I Thought He Came With You is Robert Ellison’s blog about software, marketing, politics, photography and time lapse.

Better related posts with word2vec (C#)

I have been experimenting with word2vec recently. Word2vec trains a neural network to guess which word is likely to appear given the context of the surrounding words. The result is a vector representation of each word in the trained vocabulary with some amazing properties (the canonical example is king - man + woman = queen). You can also find similar words by looking at cosine distance - words that are close in meaning have vectors that are close in orientation.

This sounds like it should work well for finding related posts. Spoiler alert: it does!

My old system listed posts with similar tags. This worked reasonably well, but it depended on me remembering to add enough tags to each post and a lot of the time it really just listed a few recent posts that were loosely related. The new system (live now) does a much better job which should be helpful to visitors and is likely to help with SEO as well.

I don't have a full implementation to share as it's reasonably tightly coupled to my custom CMS but here is a code snippet which should be enough to get this up and running anywhere:

The first step is getting a vector representation of a post. Word2vec just gives you a vector for a word (or short phrase depending on how the model is trained). A related technology, doc2vec, adds the document to the vector. This could be useful but isn't really what I needed here (i.e. I could solve my forgetfulness around adding tags by training a model to suggest them for me - might be a good project for another day). I ended up using a pre-trained model and then averaging together the vectors for each word. This paper (PDF) suggests that this isn't too crazy.

For the model I used word2vec-slim which condenses the Google News model down from 3 million words to 300k. This is because my blog runs on a very modest EC2 instance and a multi-gigabyte model might kill it. I load the model into Word2vec.Tools (available via NuGet) and then just get the word vectors (GetRepresentationFor(...).NumericVector) and average them together.

I haven't included code to build the word list but I just took every word from the post, title, meta description and tag list, removed stop words (the, and, etc) and converted to lower case.

Now that each post has a vector representation it's easy to compute the most related posts. For a given post compute the cosine distance between the post vector and every other post. Sort the list in ascending order and pick however many you want from the top (the distance between the post and itself would be 1, a totally unrelated post would be 0). The last line in the code sample shows this comparison for one post pair using Accord.Math, also on Nuget.

I'm really happy with the results. This was a fast implementation and a huge improvement over tag based related posts.

Privacy Policy Update and Comment Notifications

The ITHCWY privacy policy has been updated to reflect changes in the blog comment system. Previously email addresses submitted with comments were only used to display a Gravatar. Starting today they will also be used for notifications and newsletter signup.

The first notification is when a comment is approved. You'll always be notified in this case if you enter an email address.

When you leave a comment you can opt in to receiving notifications when another comment is added to the same post.

Finally, you can also subscribe to the monthly newsletter when leaving a comment.

Subscribe via Messenger

Thanks to revoice.me you can now subscribe to I Thought He Came With You on Facebook Messenger, Telegram, Slack and/or Chrome Notifications. To sign up visit ITHCWY on revoice.me.

Host change

Updated on Wednesday, June 28, 2017

I'm switching hosts so there will be various DNS changes and some downtime today.

Get ITHCWY By Email

I'm over social media - the Facebook page for this blog is a hopeless way to reach people and I removed the slow horrible sharing widgets a while ago. But I have this nagging suspicion that RSS is a super-niche activity for techno-libertarians harking back to the good old days of the Internet with open protocols and wall-free gardens and isn't entirely up to snuff either. So I'm going to experiment for a monthly email list for people who vaguely follow the blog or use Catfood Software products but don't quite manage to come back here every day to check for updates. Sign up here.

Why? Excellent question. The rules for blogs are to pick a narrow topic of interest, know your audience and do keyword research and drop SEO honeypot bombs to draw that audience in. I did that for Catfood Software but this isn't that kind of blog. It's a random collection of my hobbies and interests. So if you're not sure read through the Featured section in the side bar to get a preview.

I write a lot of code so what you'll get for sure is updates from Catfood Software and other occasional side projects. When I struggle with the process or discover something I write about that as well - these posts are more interesting to other developers and less exciting if you just want your desktop wallpaper (or Android phone) to look awesome. I love to make videos that don't have me in as well, mainly complicated time-lapses so you'll find a lot of those too. Also hikes in and around the San Francisco Bay Area. Occasionally politics.

If that works for you and you're not an RSS type then please join and let me know how I'm doing.