CategorySite twiddles

Dull stuff about tweaks to the site. Hardly worth bothering with.

I found the hacker… and I’m wondering where else he might be

So finally I had to take Stefan’s advice. Having upgraded to WordPress 2.8.5 over at the Free Our Data blog (where I’ve been having problems with a hacker who’s been inserting spam links invisibly into the end of the page), I …

Oh. And while I was writing that, I noticed – from the FTP transfer that was going on to do a second comparison – that there are a ton of spam pages in the site. Sodding hackers.

Anyway. I downloaded the entire blog content, and then ran a diff – that is, Filemerge (which comes with the Apple Developer Tools, free on your OSX install disk). It compares the content of any set of files, or directories.

Of course the site I’d downloaded was pretty old, and had been upgraded loads of times, so there were loads of files that were on the left (old) and not on the right (new). They just hadn’t been deleted.

Slogging on… I came across a WordPress page which explains which files have been deleted in the move up from 2.7 to 2.8. It’s a useful list and I was working my way through it. Slowly.

By now I’d got to blog/wp-includes/js/tinymce/themes/advanced/skins/o2k7/ and was starting to marvel at how deep WordPress is. When I came across a rather odd one – ui.php – which had the interesting opening:

Codz by angel(4ngel)

Make in China


Hmm, is it very likely that a valid WordPress file would really have that sort of comment? And more telling was that when you loaded it in a PHP editor with live PHP generation, you get this:

Yup, it's a hacker's login

Which in essence says: oh, lordy, you’ve been hacked.

Much digging around followed. It’s a fascinating file: it allows the hacker to download your database, and possibly upload chunks as well. I’m going to have to do an SQL dump now to see whether the content of older posts has been hacked (a favourite trick, apparently).

I also discovered a slew of website pages hidden in a directory called “Online” in the “Default” theme folder – which of course every WordPress install will generally have, so that’s a smart place to put it. (That also makes it a good one to delete.)

But as far as I can tell, the site is clean now. My best guess for how they did this is that it was one of the WordPress weaknesses via user registration – this one? This one? There are so many to choose from – and that it’s been sitting there for an age, just waiting to be exploited, or perhaps being exploited and I didn’t spot it. (Certainly neither Google’s indexing nor I discovered the hack of the /default/images folder – which is intriguing. Have you checked that folder lately?)

I hope this is the end of the tale. I’m not pinning everything on it though.

One other point: thanks again to Stefan Pause, who has helped a lot on this (what’s your site, Stefan?) I’m now alerted to the WordPress Exploit Scanner plugin, which will look through your site and find any suspicious CSS, HTML or similar. It reckons that there’s nothing suspicious in the older posts. Good-o, though I’d like to (and will) make sure myself.

Endnote: interestingly, Google won’t allow the ui.php file to be emailed, even in zip form. (I wanted to send it to my web host to explain what I’d found and tell them to search for it.) So obviously Google Mail’s already got some sort of hashing going on to detect malware being passed around. Impressive.

The hackers? Boring, really, that this sort of endless diversion from site to site is how they make their money. All that enthusiasm and knowledge and ability, turned to trying to persuade people lacking self-esteem to buy pills of unknown quality from sites of extremely dubious status. Isn’t there something better we could do with all our time here?

Super-endnote: And then I find another file – this one at /wp-includes/Text/Renderer/Diff/ where there was one called online.php (a bit of a clue by now, because it’s all about “online” crap these guys are selling.)

The WP Exploit Scanner tipped me off – it notes that it’s a base_64 command, which usually means “something to hide”.

And so it proves: here’s the picture you get
Hacker control interface dropped inside WordPress

(You can see the full-size thing at Flickr.) And hey – what is it about hackers and the black backgrounds? Too much watching the Matrix, I think. Forget it, guys – you’re not The One, you’re pushing junk pills.

Astronauts.. Customs.. yeah, been there, written that.

So I see that the top story on Reddit (atm) is about how the astronauts had to come through Customs when they returned from the Moon.

Yup, I recall writing that one.. ooh, was it really as long as as February 2001? Yes it was. Text below: it appeared on February 19 2001. Maybe it’s still there somewhere on the Indie site. (Update: yes it is.) Nothing new under the sun. Tom Wilkie, who was science editor when I joined, felt it was time to move on when the same story came around the third time (life on Mars, are we all doooooomed, how old is the universe, etc). Now I think: only three times? With the web, that only gives you about four years maximum.

Which reminds me that I must get my full-text articles database up. Code is written, only remains to load it in and generate an interface.

First published in The Independent February 19 2001; you can link, but please don’t copy. A side note: this appeared as the “basement” (the story on the bottom of the page) of the broadsheet, light relief from whatever was the main news that day. No room for such frivolities these days.

It was a small step for a man, a giant leap for mankind, but for the US Customs it was just another day at the office. Which is why when the triumphant crew of Apollo 11, led by Neil Armstrong, returned to Earth, one of the first questions they faced was: are you going through the red channel or the green channel?

Documents which have just come to light via the Internet show that even if you’ve just travelled to the Moon and back – especially if you’ve just travelled to the Moon and back – the US Customs wants to know what you’ve got. Anyone who has visited the US will be familiar with the huge litany of items which travellers are required to declare, such as plants, drugs and other preparations.

Historians at the US space agency Nasa have confirmed that the document, headed “General declaration” and signed by the three crew members, Neil Armstrong, Edwin “Buzz” Aldrin and Michael Collins, is authentic. It lists the departure point as “Moon” and arrival as “Honolulu” on July 24, 1969, where the travellers set foot on Earth again after splash-landing in the Pacific Ocean.

But what, Customs wanted to know, was in those bags? “Moon rock and Moon dust samples”, the crew responded. How many people had disembarked or joined the round trip from Cape Kennedy? Thankfully, the answer to both was “nil”: no lost souls and no extra aliens. And was anyone ill, and were there “any other conditions on board which may lead to the spread of disease” – which in this case would presumably be mysterious space viruses? “To be determined”, the crew responded to the latter question, though the test of time suggests that nothing untoward happened.

It is unclear whether this practice became the pro forma for returning lunar astronauts from Apollo 12, 14, 15, 16 and 17? “We have a lot of records here, but that would be something really for Customs,” said Colin Fries of Nasa historian’s office. “We think that maybe it was one of those cases where everyone was trying to get in on the act, because it was such a big thing, after all.” But he is not certain that other crews did not also have to fill out a similar form: “it’s hard to prove a negative”, he commented.

And here’s the scanned image, from (I think, from memory) Steve Bellovin.

Fire up the engine of serendipity: getting the database to mine the Long Tail of this blog automagically

When I worked for The Register, the only tedious task after you’d written a story – in what is one of the most friction-free environments I’ve ever worked in – was finding the “Related stories” to go at the end. (See this story for an example.) You had to plug in certain keywords, see what came up, and select them. Very good for getting people to click through (Long Tail and all that), but booooring to do. One’s obvious reaction: can’t machines do this sort of thing?

Turns out – yes. I think El Reg has moved on, and got the machines to do it, and now so have I. I’ve turned on a WordPress plugin called “related posts” (actually, nice-related-posts, which I’ve tweaked, which is itself based on related-posts).

“Related posts” works by getting the database that holds all of the content of your posts and telling MySQL to make an index of the full text. MySQL then has a function where you throw a couple of words at it and it will give you a “score” of other entries according to those words. Take the top-scoring entries, and you have a set of related posts.

However I’ve put that on steroids by using MySQL’s “WITH QUERY EXPANSION“. The reason is that the normal “score” system has a weakness: it only searches on the post title words. (Too hard to do the whole post.) But “WITH QUERY EXPANSION” throws up a list of posts and then finds words with related words, and runs a score on them all. Result: you get better matches.

Now, I have no idea how well this will work (though a brief glance suggests it’s quite good). I thought that it would get readers to discover stuff I’ve written that might have some relevance (there’s all the stuff about banks, and PKRSER.COM/Partygaming, and G4). And I also realised that I’ve written so much that I’ve forgottten a lot of it. “Related posts” might remind me of it.

Only thing – you’ll have to come to the site to see it. It isn’t in the feed (because it isn’t in the database.) It does mine both forwards and backwards from older posts, which I think is nifty.

And if you visit the site, you’ll also see a few other changes around here. You’ll find a smart picture in the right-hand column showing the Guardian’s front page today (whichever today you’re reading it in). There are also links to stories that have appeared in the paper, and stuff that I’ve blogged there. If I had literary stalkers, this would be a goldmine.

I haven’t decided how many is the best number of related posts to show. Five, as I’ve done at the Free Our Data blog, which I also run? Four? I thought most people would only bother with three. Tell me what you think – it’s not as if it’s a difficult change.

A PHP/MySQL question – geeks only need apply: how to get a formatted date from an associative array

(He writes, noting all the comments piling up about PKRSER.COM.. hmm…)

I’ve got an intriguing PHP problem. Well, I hope it’s intriguing. I’m trying to format a date that comes out of MySQL. Presently I have to do it as
"SELECT ID, post_title, post_date,MATCH (post_name, post_content) AGAINST ('$terms') AS score.. "

That produces post_date which is in the MySQL datetime format – 2007-03-08 07:55:37.

But I want to put that in the form of “8 March 2007”. Yes, I know, you’re saying “Use MySQL’s date_format command!” Not so fast. The query produces an associative array called (surprise!) $results, where the results are then called one by one in a foreach loop (foreach $results as $result) thus:

$title = htmlspecialchars(stripslashes($result->post_title));
$permalink = get_permalink($result->ID);

See? You asked it for ID and post_title and it comes directly out of $result. Trouble is, post_date comes out in that boring “2007-03-08 07:55:37” format.

Now, MySQL’s date_format could be used: SELECT ID, post_title, date_format(post_date, "%d %M %Y"),MATCH (post_name, post_content)...

And that would yield the post date nicely formatted. But how the hell do I call it?
$published = $result->date_format fails.
$published = $result[2] fails (which I found odd.)
How do I call the variable created by MySQL’s date formatting? Or am I doomed to use the rather boring substr on the boring date?

(I don’t really want to go off into parsing the months of the date, just to make the pages run quicker.)

You might be able to guess what this is for, though it requires some low-level tweaking too.

The original I’m working with is the tweaked Nice Related Posts plugin for WordPress, which I’m tweaking further. If you want, I can release the code, assuming I’m ever done.

(And you have to give its author kudos for the domain name. “Some fool with a .com”. Yeah.)

Update: solved – tip o’ the hat to Jason and L in the comments, who point out that one can use the SQL command SELECT date_format(post_date,"%e %M %Y") as publ_date… and then call $result->publ_date which will be formatted as 8 March 2007 (in that setting). Thank you, the lazyweb.

Wait, no, don’t stop reading this!

Interesting read over at problogger on 34 reasons why readers unsubscribe from your blog:

Thanks to everyone who has added their thoughts on why they unsubscribe from a blog’s RSS feed. There have been 109 comments left on that post so far and some interesting recurring themes have emerged.

…Obviously with 103 opinions (and most people giving multiple reasons all in their own words) I’ve had to make some judgement calls in classifying comments left. Some of the categories below have overlap but I think you’ll get a pretty good picture of what motivates people to unsubscribe from RSS feeds.

34 Reasons Why People Unsubscribe from RSS feeds:

  1. Too many posts (the post levels are too overwhelming) – 37
  2. Infrequent Posting (or the blog is effectively dead) – 29
  3. Partial Excerpts Feeds – 25
  4. Blog Changes Focus (too much off topic posting) – 23
  5. Too many posts that I see elsewhere (Redundant, Repeated or Recycled News) – 19
  6. Uninteresting Content – 16
  7. Irrelevant Content – 13
  8. The Blogger’s Ego – Too much self promotion – 11
  9. Low Quality Content – 11
  10. Too many posts that are too long – 10
  11. Negative blogging – 7
  12. Feed Errors – Especially when a Feed Reloads the latest 10-20 posts every time – 7
  13. Offensive Content/Personal attacks/Discrimination – 6
  14. ‘infomercials’ (too much selling) – 6
  15. Blog Titles that Don’t Tell what the post is about – 5
  16. No or Poor Formatting in posts – 5
  17. My own interests as a reader change – 5
  18. No Longer Useful or Valuable – 4
  19. Too many links in the text and not enough content – 4
  20. Advertising – 3
  21. Inconsistent writing (style and focus) – 2
  22. Too Many Grammatical Errors – 2

…and so on. It gets a bit trivial after that.

I find it interesting that “partial RSS feeds” is the No.3 (after the rather difficult to reconcile “too much” and “too little”). It’s certainly what frustrates me the most; there aren’t that many feeds that I stay subscribed to which only give a partial content. (I still don’t get why Martin Stabe doesn’t do a full feed, for instance. Are there adverts I’ve not noticed, Martin? And none of the RSS 0.92, RSS 2.0 or Atom 0.3 feeds offers a full feed.)

So anyway, I shall endeavour not to post too much, nor too little. And the feed remains full.

How are would-be spammers registering on my WordPress blog if I’ve disabled registering?

Recognising that letting people “register” on one’s blog – which in some cases gives them privileges such as being able to write or post or comment – spammers have set up systems to automatically register on peoples’ blogs so they can spam them.

I don’t allow registration on this blog: I’ve turned it off, and also hacked the file that would let you register (it’s ..). [see below]

You shouldn’t be able to register, because I’ve hacked the file that processes the registration form’s output. Yet some spammers do manage to register – there were two “accepted” registrations yesterday. How’s it done, so I can block that hole?

Update: Durr, worked it out. I didn’t copy over the old register file, which would have blocked registrations, when I updated the blog a while ago. Thanks to all those who’ve proved that it does work. Link now killed. :-)

I’m now going to break the blog..

OK, I’m now going to upgrade this blog to the latest version of WordPress. Which may break quite a few things. Hold tight..

..ooh, it all worked. Splendid – only had to restart one plugin. WordPress is definitely a triumph of open source.

Let the bug-ironing begin! Plus, does teaching Java dumb would-be programmers down?

Ah, so there’s a new version of WordPress – version 2 – officially released. It has all sorts of whizzy enhancements, apparently, though I have to say that this site (using 1.5) seems pretty good to me. Personally I’m going to hold off taking it up until I’ve seen how well its anti-spam functionality works, and in particular whether Spam Karma 2 works with it. I know there’s been some upgrading of SK to cope with 2.0, but as there’s also an anti-spam plugin in 2.0, and as I’m running various other plugins (like recent comments, recent posts, which while simple in PHP terms still might need some hacking around that I’m unwilling to do for time reasons – I mean, look at how I still haven’t got the nested LI items right on the bottom of the RH column here) I think I’ll just wait and see how the bugs unravel, or whatever it is that bugs do. (Appear? Emerge?)

Meanwhile, I’ll just point you to The Perils of JavaSchools – Joel on Software

You may be wondering if teaching object oriented programming (OOP) is a good weed-out substitute for pointers and recursion. The quick answer: no. Without debating OOP on the merits, it is just not hard enough to weed out mediocre programmers. OOP in school consists mostly of memorizing a bunch of vocabulary terms like “encapsulation” and “inheritance” and taking multiple-choice quizzicles on the difference between polymorphism and overloading. Not much harder than memorizing famous dates and names in a history class, OOP poses inadequate mental challenges to scare away first-year students. When you struggle with an OOP problem, your program still works, it’s just sort of hard to maintain. Allegedly. But when you struggle with pointers, your program produces the line Segmentation Fault and you have no idea what’s going on, until you stop and take a deep breath and really try to force your mind to work at two different levels of abstraction simultaneously.

A fascinating article, showing that dumbing-down doesn’t only happen on TV; it can happen in software too.

Really trivial tweaks of our time.. and the value of this blog. You’ll be astounded

For anyone who also runs a blog, this is more relevant. Anyway, I’ve tweaked the “recent comments” thing (it’s on the right-hand side, further down) so that it now

  • shows who commented (OK, it already did that)
  • shows what post they’re commenting on – NEW!
  • if you hold your mouse over the post name, shows the first 15 words of the comment – NEW!
    (though do you think it should show more words from the comment? Tell me)
  • shows the URL of anyone who puts a URL in the comment field – NEW!

This has been done by a judicious bit of tweaking of the “Recent Comments” plugin, available here.

Oh and by the way..

Isn’t that great?

Spam Karma: read those stats and cheer

Here’s what Spam Karma is telling me about how much it’s caught since I installed it on Sept 26 (that’s just over a month ago):

  • Total Spam Caught: 2019 (average karma: -254.08)
  • Total Comments Approved: 151 (average karma: 8.74)
  • Total Comments Moderated: 17

And in that period there have been 150 real comments, and 1 (count it!) comment that slipped through – a “free iPod” site advert that was manually posted.

Hmm, 2020 spams and 150 real comments. Spam outweighs real people just as in email.