I've just launched a redesign of I Thought He Came With You. The main thrust is to make the site more usable on desktops. Which seems nuts, but the data doesn't lie. The site has low mobile traffic and for a while I thought this was some kind of technical issue. I optimized the design heavily for mobile and spent a lot of time on speed and some AMP. I guess it's the content. Google loves it when I write documentation for them and doesn't think I have anything useful to say on politics. They're probably right. So I've gone back to having an old school sidebar and I've taken the performance hit of using Bootstrap to get some better looking forms and navigation without spending a lot of time on it. I hope you enjoy it, and if you find anything broken please email or leave a comment.
I have been experimenting with word2vec recently. Word2vec trains a neural network to guess which word is likely to appear given the context of the surrounding words. The result is a vector representation of each word in the trained vocabulary with some amazing properties (the canonical example is king - man + woman = queen). You can also find similar words by looking at cosine distance - words that are close in meaning have vectors that are close in orientation.
This sounds like it should work well for finding related posts. Spoiler alert: it does!
My old system listed posts with similar tags. This worked reasonably well, but it depended on me remembering to add enough tags to each post and a lot of the time it really just listed a few recent posts that were loosely related. The new system (live now) does a much better job which should be helpful to visitors and is likely to help with SEO as well.
I don't have a full implementation to share as it's reasonably tightly coupled to my custom CMS but here is a code snippet which should be enough to get this up and running anywhere:
The first step is getting a vector representation of a post. Word2vec just gives you a vector for a word (or short phrase depending on how the model is trained). A related technology, doc2vec, adds the document to the vector. This could be useful but isn't really what I needed here (i.e. I could solve my forgetfulness around adding tags by training a model to suggest them for me - might be a good project for another day). I ended up using a pre-trained model and then averaging together the vectors for each word. This paper (PDF) suggests that this isn't too crazy.
For the model I used word2vec-slim which condenses the Google News model down from 3 million words to 300k. This is because my blog runs on a very modest EC2 instance and a multi-gigabyte model might kill it. I load the model into Word2vec.Tools (available via NuGet) and then just get the word vectors (GetRepresentationFor(...).NumericVector) and average them together.
I haven't included code to build the word list but I just took every word from the post, title, meta description and tag list, removed stop words (the, and, etc) and converted to lower case.
Now that each post has a vector representation it's easy to compute the most related posts. For a given post compute the cosine distance between the post vector and every other post. Sort the list in ascending order and pick however many you want from the top (the distance between the post and itself would be 1, a totally unrelated post would be 0). The last line in the code sample shows this comparison for one post pair using Accord.Math, also on Nuget.
I'm really happy with the results. This was a fast implementation and a huge improvement over tag based related posts.
The first notification is when a comment is approved. You'll always be notified in this case if you enter an email address.
When you leave a comment you can opt in to receiving notifications when another comment is added to the same post.
Finally, you can also subscribe to the monthly newsletter when leaving a comment.
I'm switching hosts so there will be various DNS changes and some downtime today.
You know how you're debugging and comment out that return statement that stops book reviews from being posted more than once a month so you can get to the bottom of a problem without constantly deleting posts? And then you get distracted and push a new version of the blog software with that return statement still commented out? Thankfully that task is only scheduled to run every four hours. Sorry.
I'm over social media - the Facebook page for this blog is a hopeless way to reach people and I removed the slow horrible sharing widgets a while ago. But I have this nagging suspicion that RSS is a super-niche activity for techno-libertarians harking back to the good old days of the Internet with open protocols and wall-free gardens and isn't entirely up to snuff either. So I'm going to experiment for a monthly email list for people who vaguely follow the blog or use Catfood Software products but don't quite manage to come back here every day to check for updates. Sign up here.
Why? Excellent question. The rules for blogs are to pick a narrow topic of interest, know your audience and do keyword research and drop SEO honeypot bombs to draw that audience in. I did that for Catfood Software but this isn't that kind of blog. It's a random collection of my hobbies and interests. So if you're not sure read through the Featured section in the side bar to get a preview.
I write a lot of code so what you'll get for sure is updates from Catfood Software and other occasional side projects. When I struggle with the process or discover something I write about that as well - these posts are more interesting to other developers and less exciting if you just want your desktop wallpaper (or Android phone) to look awesome. I love to make videos that don't have me in as well, mainly complicated time-lapses so you'll find a lot of those too. Also hikes in and around the San Francisco Bay Area. Occasionally politics.
No books this month.
RT @drclue: "drclue: #pearlhunt making progress... http://t.co/FAgLQ2UH" --http://www.twitter.com/drclue/status/185829093244280832
ITHCWY: Agua: Little known fact, geologists would tell you that Bernal Hill is made of chert, actually it's mostly… http://t.co/xMRm2J9n
ITHCWY: Mangler: I don't know what the machine attached to our office does but it's giving me nightmares. http://t.co/BSbvoshq
ITHCWY: It was where he left it: Not to bang on about the BBC and their horrible headlines but 'lost' is a bit… http://t.co/bDKjUbl4
ITHCWY: Executive Clubbing: I used to really love British Airways. I even got over their silly new livery and… http://t.co/NC2Bt9bM
ITHCWY: Sand Ladder at Fort Funston http://t.co/5aeQjoti
ITHCWY: SFO http://t.co/sB1QdXCt
ITHCWY: Robot Ahead http://t.co/tn7mHTI8
ITHCWY: Goldilocks: Israel just banned models with a BMI under 18.5. That's not severely underweight, it's the… http://t.co/G5HCE5Ey
Good weekend to skip Fort Funston: http://t.co/UE8blE4c
ITHCWY: Catfood: PdfScan 1.40: Catfood PdfScan 1.40 is a small bug fix release. PdfScan converts documents to PDFs… http://t.co/YXdMn6ux
ITHCWY: Three reasons the dream of a robot companion isn't over: David Lee reports from the Innorobo 2012… http://t.co/JndJZahn
ITHCWY: Fixing dropped wireless connection for Linksys E4200: I've been going quietly mad trying to fix a constant… http://t.co/VVZ2dl2m
ITHCWY: Sweeney Ridge: Sweeney Ridge, starting from Skyline College and walking up to the Portola Expedition… http://t.co/g1HIms1F
"not a threat to the penguins, we don't suspect" - http://t.co/oIrzQOEj - it wasn't a dream!
The Snowman by Jo Nesbø
Very good, enjoying the entire Harry Hole series. Wishing for translations of the first two now!
The Devil's Star by Jo Nesbø
Slightly weaker than the others in the series I've read so far but still knocked it back quickly.
The Redbreast by Jo Nesbø
Best so far on my quest to read through Nesbo...
Nemesis by Jo Nesbø
On a Jo Nesbo binge...
The Leopard by Jo Nesbø
Compelling crime thriller, rather worryingly one of series featuring Harry Hole so I'm going to have to go back to the beginning and read all of them.
Catfood.Shapefile 1.51: http://t.co/BKtkx9Zq (ESRI Shapefile Parser, fixed release binary issue).
4 of 5 stars to The Snowman by Jo Nesbø http://t.co/IrvdrDBf
Breaking Good: how to synthesize Pseudoephedrine (Sudafed) From N-Methylamphetamine (crystal meth): http://t.co/fviYaj5P
ITHCWY: Catfood.Shapefile 1.50: I've just released a small update to my C# Shapefile library on Codeplex. Catfood… http://t.co/lXoGoBsY
4 of 5 stars to The Redbreast by Jo Nesbø http://t.co/PqrOQnQL
A History of the Sky for One Year: http://t.co/UKMjosCK (very cool)
ITHCWY: Badge Driven Development: Microsoft has released Visual Studio Achievements, an extension that brings… http://t.co/5BOyNF03
ITHCWY: GGNRA Dog Management Plan Update: I love it when making some noise works. The NPS is pushing its dog… http://t.co/fzqaJWM2
Unicode Character 'PILE OF POO' (U+1F4A9): http://t.co/LkGffsvW
BBC News - Can the US Army embrace atheists? http://t.co/5ubkKT7r
4 of 5 stars to The Leopard by Jo Nesbø http://t.co/tIIPs1M5
ITHCWY: Reviews and Links for January 2012: Damned by Chuck Palahniuk 3/5 Very much a vehicle for Palahniuk to rant… http://t.co/6kvApyf1
The Spire by Richard North Patterson
A good enough holiday read and nice to see Patterson return to a straight psychological thriller rather than the last few OpEds loosely wrapped with some plot.
Advanced .NET Debugging (Addison-Wesley Microsoft Technology Series) by Mario Hewardt
Comprehensive introduction to low level .NET debugging - when you need to fire up WinDbg to check out the state of the managed heap, or debug a crash dump from the field you'll find this book invaluable. I wish it had been available when I started figuring out how to use SOS.
The Complete Stories of J. G. Ballard by J.G. Ballard
Wonderful collection of all of Ballard's short stories. It's a huge book with surprisingly few duds. My favorites include The Illuminated Man, clearly the inspiration for The Crystal World, which includes meaning bombs like "It's almost as if a sequence of displaced but identical images were being produced by refraction through a prism, but with the element of time replacing the role of light." and The Ultimate City (which isn't using ultimate in the sense of being good...). I've read most of Ballard's novels but not many of the short stories before. They're well worth the time.
- Microsoft Agrees With Apple And Google: “The Future Of The Web Is HTML5″ from TechCrunch (Which makes it all the more tragic that a huge number of clients will still be running IE6 :().
- New Google Maps Option: "Avoid Arizona" (alright, it's a joke) from Boing Boing (ROFLATSDOON).