Fire up the engine of serendipity: getting the database to mine the Long Tail of this blog automagically

When I worked for The Register, the only tedious task after you’d written a story – in what is one of the most friction-free environments I’ve ever worked in – was finding the “Related stories” to go at the end. (See this story for an example.) You had to plug in certain keywords, see what came up, and select them. Very good for getting people to click through (Long Tail and all that), but booooring to do. One’s obvious reaction: can’t machines do this sort of thing?

Turns out – yes. I think El Reg has moved on, and got the machines to do it, and now so have I. I’ve turned on a WordPress plugin called “related posts” (actually, nice-related-posts, which I’ve tweaked, which is itself based on related-posts).

“Related posts” works by getting the database that holds all of the content of your posts and telling MySQL to make an index of the full text. MySQL then has a function where you throw a couple of words at it and it will give you a “score” of other entries according to those words. Take the top-scoring entries, and you have a set of related posts.

However I’ve put that on steroids by using MySQL’s “WITH QUERY EXPANSION“. The reason is that the normal “score” system has a weakness: it only searches on the post title words. (Too hard to do the whole post.) But “WITH QUERY EXPANSION” throws up a list of posts and then finds words with related words, and runs a score on them all. Result: you get better matches.

Now, I have no idea how well this will work (though a brief glance suggests it’s quite good). I thought that it would get readers to discover stuff I’ve written that might have some relevance (there’s all the stuff about banks, and PKRSER.COM/Partygaming, and G4). And I also realised that I’ve written so much that I’ve forgottten a lot of it. “Related posts” might remind me of it.

Only thing – you’ll have to come to the site to see it. It isn’t in the feed (because it isn’t in the database.) It does mine both forwards and backwards from older posts, which I think is nifty.

And if you visit the site, you’ll also see a few other changes around here. You’ll find a smart picture in the right-hand column showing the Guardian’s front page today (whichever today you’re reading it in). There are also links to stories that have appeared in the paper, and stuff that I’ve blogged there. If I had literary stalkers, this would be a goldmine.

I haven’t decided how many is the best number of related posts to show. Five, as I’ve done at the Free Our Data blog, which I also run? Four? I thought most people would only bother with three. Tell me what you think – it’s not as if it’s a difficult change.

2 Comments

  1. Good luck with the MySQL stuff. Thanks for the ‘heads up’ on “WITH QUERY EXPANSION” which looks interesting. However, any chance of making these comment boxes a bit wider (FF with 1152 x 864 screen) – it’s the function wpopen().

  2. The Reg, hasn’t moved on quite that much unfortunately. The keywords are now automatic, but we still have to find related stories ourselves. Guh.

    Nice to meet you, by the way.

Comments are closed.