I Thought He Came With You is Robert Ellison’s blog about software, marketing, politics, photography and time lapse.

Better related posts with word2vec (C#)

I have been experimenting with word2vec recently. Word2vec trains a neural network to guess which word is likely to appear given the context of the surrounding words. The result is a vector representation of each word in the trained vocabulary with some amazing properties (the canonical example is king - man + woman = queen). You can also find similar words by looking at cosine distance - words that are close in meaning have vectors that are close in orientation.

This sounds like it should work well for finding related posts. Spoiler alert: it does!

My old system listed posts with similar tags. This worked reasonably well, but it depended on me remembering to add enough tags to each post and a lot of the time it really just listed a few recent posts that were loosely related. The new system (live now) does a much better job which should be helpful to visitors and is likely to help with SEO as well.

I don't have a full implementation to share as it's reasonably tightly coupled to my custom CMS but here is a code snippet which should be enough to get this up and running anywhere:

The first step is getting a vector representation of a post. Word2vec just gives you a vector for a word (or short phrase depending on how the model is trained). A related technology, doc2vec, adds the document to the vector. This could be useful but isn't really what I needed here (i.e. I could solve my forgetfulness around adding tags by training a model to suggest them for me - might be a good project for another day). I ended up using a pre-trained model and then averaging together the vectors for each word. This paper (PDF) suggests that this isn't too crazy.

For the model I used word2vec-slim which condenses the Google News model down from 3 million words to 300k. This is because my blog runs on a very modest EC2 instance and a multi-gigabyte model might kill it. I load the model into Word2vec.Tools (available via NuGet) and then just get the word vectors (GetRepresentationFor(...).NumericVector) and average them together.

I haven't included code to build the word list but I just took every word from the post, title, meta description and tag list, removed stop words (the, and, etc) and converted to lower case.

Now that each post has a vector representation it's easy to compute the most related posts. For a given post compute the cosine distance between the post vector and every other post. Sort the list in ascending order and pick however many you want from the top (the distance between the post and itself would be 1, a totally unrelated post would be 0). The last line in the code sample shows this comparison for one post pair using Accord.Math, also on Nuget.

I'm really happy with the results. This was a fast implementation and a huge improvement over tag based related posts.

Reading and Writing Office 365 Excel from a Console app using the Microsoft.Graph C# Client API

Updated on Sunday, September 30, 2018

I needed a console app that reads some inputs from an online Excel workbook, does some processing and then writes back the results to a different worksheet. Because I enjoy pain I decided to use the thinly documented new Microsoft.Graph client library. The sample code below assumes that you have a work or education Office 365 subscription.

Paste the code into a new console project and then follow the instructions at the top to add the necessary NuGet packages. You'll also need to register an application at https://portal.azure.com/. You want a Native application and you'll need the Application ID and the redirect URL (just make up some non-routable URL for this). Under Required Permissions for the app you should add read and write files delegated permissions for the Microsoft Graph API.

Hope this saves you a few hours. Comment below if you need a more detailed explanation for any of the above.

Shapefile Update

A few people have asked for 3D shape support in my ESRI Shapefile library. I've never got around to it, but CodePlex user ekleiman has forked a version in his ESRI Shapefile to Image Convertor that supports PointZ, PolygonZ and PolyLineZ shapes. If that's what you need please check it out.

Crushing PNGs in .NET

Updated on Sunday, September 30, 2018

Crushing PNGs in .NET

I'm working on page speed and Google PageSpeed Insights is telling me that my PNGs are just way too large. Sadly .NET does not provide any way to optimize PNG images so there is no easy fix - just unmanaged libraries and command line tools.

I have an allergy to manual processes so I've lashed up some code to automatically find and optimize PNGs in my App_Data folder using PNGCRUSH. I can call CrushAllImages() to fix up everything or CrushImage() when I need to fix up a specific PNG. Code below:

Minify and inline CSS for ASP.NET MVC

Updated on Sunday, September 30, 2018

ASP.NET has a CssMinify class (and a JavaScript variant as well) designed for use in the bundling pipeline. But what if you want to have your CSS minified and inline? Here is an action that is working for me (rendered into a style tag on my _Layout.cshtml using @Html.Action("InlineCss", "Home")).

Note that I'm using this to inline CSS for this blog. The pages are cached so I'm not worried about how well this action performs. My blog is also basically all landing pages so I'm also not worried about caching a non-inline version for later use, I just drop all the CSS on every page.

Personal Finger Daemon for Windows

Updated on Sunday, September 30, 2018

Did you know that Windows still has a vestigial finger command with just about nothing left to talk to? One of my New Year's resolutions is to bring finger back and unlike the stalled webfinger project I need to make some progress. Here's some C# to run your own personal finger daemon... you just need to create a .plan file in your home directory (haven't done that for a while):

ZoneInfo Update (tzdata for .NET)

Updated on Thursday, November 12, 2015

ZoneInfo Update (tzdata for .NET)

I've used the ZoneInfo (PublicDomain.ZoneInfo) project from CodePlex for quite a few years, especially in Catfood Earth. The project had rusted a little so I emailed the author (Mark Rodrigues) and he was kind enough to add me as a developer. I've just updated ZoneInfo with some of the local changes I'd made and a variety of patches from the CodePlex community. It now works with the latest IANA tzdata file, at least for the test cases I can run. Let me know if I missed something (and thanks Mark for letting me contribute back to this very helpful project).

Google Spreadsheets API and Column Names

Updated on Thursday, November 12, 2015

Google Spreadsheets API and Column Names

I had a play with the Google Spreadsheets API recently to feed in some data from a C# application. The getting started guide is great and I was authenticated and adding dummy data in no time. But as soon as I started to work with real data I got:

"The remote server returned an error: (400) Bad Request."

And digging deeper into the response:

"We're sorry, a server error occurred. Please wait a bit and try reloading your spreadsheet."

The original sample code still worked so it didn't seem like any sort of temporary glitch as the message suggests. After much hair torn it turns out I was getting this error because I had used the literal column names from my spreadsheet. The API expects them to be lower case with spaces removed. If not columns match you get the unhelpful error above, if at least one column matches you get a successful insert with some missing data.

Error messages are one of the hardest parts of an API to get right. If you're not very detailed then what seems obvious to you can leave your developers stumped.

Hope this helps someone else...

Fix search on enter problem in BlogEngine.NET

Updated on Thursday, November 12, 2015

Search on enter has been broken for a while in BlogEngine.NET (I'm running the latest 2.8.0.1 version). Finally got a chance to look at this today and there is a simple patch to the JavaScript to fix it. See the issue I just filed on CodePlex for details.

How to get SEO credit for Facebook Comments (the missing manual)

Updated on Sunday, September 30, 2018

How to get SEO credit for Facebook Comments (the missing manual)

I've been using the Facebook Comments Box on this blog since I parted ways with Disqus. One issue with the Facebook system is that you won't get SEO credit for comments displayed in an iframe. They have an API to retrieve comments but the documentation is pretty light and so here are three critical tips to get it working.

The first thing to know is that comments can be nested. Once you've got a list of comments to enumerate through you need to check each comment to see if it has it's own list of comments and so on. This is pretty easy to handle.

The second thing is that the first page of JSON returned from the API is totally different from the other pages. This is crazy and can bite you if you don't test it thoroughly. For https://developers.facebook.com/docs/reference/plugins/comments/ the first page is https://graph.facebook.com/comments/?ids=https://developers.facebook.com/docs/reference/plugins/comments/. The second page is embedded at the bottom of the first page and is currently https://graph.facebook.com/10150360250580608/comments?limit=25&offset=25&__after_id=10150360250580608_28167854 (if that link is broken check the first page for a new one). The path to the comment list is "https://developers.facebook.com/docs/reference/plugins/comments/" -> "comments" -> "data" on the first page and just "data" on the second. So you need to handle both formats as well as the URL being included as the root object on the first page. Don't know why this would be the case, just need to handle it.

Last but not least you want to include the comments in a way that can be indexed by search engines but not visible to regular site visitors. I've found that including the SEO list in the tag does the trick, i.e.

I've included the source code for an ASP.NET user control below - this is the code I'm using on the blog. You can see an example of the output on any page with Facebook comments. The code uses Json.net.

FacebookComments.ascx:

FacebookComments.ascx.cs