Over the past few months I’ve been reading a lot about Bayesian nonparametrics. My current thoughts on the matter will make up a future post. A good starting point for understanding the hierarchical Dirichlet process mixture model is to understand the parametric form, an example of which is LDA. LDA, or latent Dirichlet allocation  is an approach to modeling a process by which natural-language documents are constructed. LDA uses the bag-of-words paradigm, where documents are really just collections of words, ignoring the order in which they appear. Obviously a simplistic view, but one that is reasonable. For the remainder of this post, I’m going to be intentionally vague about the details, as rather than get bogged down in them, I’ll leave it to those who are interested to read one of the hundreds of existing tutorials.
So the LDA story goes like this. There are K possible topics that people can write about: technology, sports, pop culture, computer science, biology, etc. Associated with each topic is distribution over words, so, for example, in the biology topic the probability associated with the word “cell” would be higher than the probability of the word “robot”, while the opposite would likely be true for the technology topic. Associated with each document is a distribution over topics, so, for example, an article in the New York Times might be about how technology is affecting the study of biological organisms, making it roughly half about the technology topic and half biology. Finally, a document is generated, first by choosing its length N, and then N times choosing first a topic t from the topic distribution, then a word from the word distribution, conditioned on t. So, in our imaginary article, approximately half of the words would be related to technology, and half to biology.
The hierarchical part of this is that we want topics to be shared among different documents, and so we need a way of tying the distributions for each document. The nonparametric extension does away with K, allowing the number of topics to depend on the data.
The trouble, then, is to estimate all of these distributions from data. That is, no one tells us what the topics are, just that there are K of them, and no one tells us what words are associated with what topics, or in what proportions. This is all learned; given a number, K, and a bunch of documents, learn everything else. Traditionally, this required having a huge batch of documents, and chugging away with a sampler or by optimising some variational bound on the posterior. More recently, Matt Hoffmann developed a way to optimise the variational bound using an online stochastic gradient approach . This allows us to incrementally feed LDA documents in small batches, rather than all at once. Matt also released code. He also has an implementation in Vowpal Wabbit, but I can’t seem to figure out how to interpret the results, so I’m sticking with the python version for now.
So, to get some experience with this online version of LDA, I wrote some scripts to listen to the Twitter streaming API, collect sets of 256 Tweets from the live feed, and performing batch updates for LDA. I am using two dictionaries, a big one containing 38138 words, and a smaller one containing 7702 words, and run separate processes, one for each dictionary. The scripts run non-stop, 24/7, parsing the Twitterverse. I used the default parameters that Matt used for analysing Wikipedia, but increased the batch size, since Tweets are much shorter than Wikipedia articles. I think that it’ll be really interesting to see how these topic models hold up on very short documents.
The results are updated every hour or so, and can be viewed here. On the results page, you first select a dictionary (I prefer the results on the smaller dictionary), and then you can scroll through the topics, viewing the 50 most likely words for that topic, by clicking on the arrows, or using the left and right arrow keys on your keyboard (much more convenient). It’s interesting to note that even with the python implementation of online LDA, it takes more time to parse 256 tweets from the streaming API than it does for the algorithm to process a batch of 256 tweets (although I think that says more about the performance of the tweet parser than it does about the LDA code).
The results are kind of cool. I’m not convinced that it’s not just my mind drawing relationships where there aren’t any, but some of the topics seem to make sense. At the time of writing this, the scripts had parsed 2.5 million random tweets. There aren’t any obviously identifiable topics, like technology or biology, but different words that are used in similar contexts seem to arise in the same topics. Remember, the words are just meaningless features in a document, so the system has no idea that words like woman and women are similar. Therefore, it’s nice to see them both appear in a common topic. Other examples are “didn’t, doesn’t, shouldn’t” all being in a topic, and “london, ireland, dublin” all appearing in another topic. There are a whole bunch more, but I don’t want to ruin all of the surprises.
So anyway, I thought that this was a neat little project, and I’m looking forward to any feedback that you have. It’s possible that LDA is just going to fail horribly on such small documents, and likely that I’m using awful parameters, but the results look a bit promising. Currently, we need a dictionary of acceptable words, and given the 140 character limit on tweets, people tend to use shortened versions of words that might not show up in the dictionary. Vowpal Wabbit doesn’t have this restriction (it hashes words, rather than associating an index with every word in the dictionary), so if I can just figure out how to interpret the results, I’ll switch over and see what happens. I’ll also probably switch to a C++ version, at least for the JSON parsing of the twitter stream, because currently the performance is abysmal, and although I haven’t really profiled it to ensure it’s the parsing that is the bottleneck, I’m fairly confident that it is.
Now I’m on to thinking about how to visualize the output. I have in my mind some far-out interface where there is a graph of tweets with edges connecting those that are close in topic-space, and you can zoom around the graph. Of course you’d want to do this in a web browser, and so the trick is how to avoid having to send 5 million tweets to the client. I have a script that can take a tweet, or any sample sentence, and find the 10 nearest neighbours in topic-space (using the dot product between topic vectors as a measure of similarity), but the thing takes 10 minutes just to load and parse the 2.5 million tweets that I’ve collected. Not really suited for a web service.
I’m also (finally) getting used to talking about tweets without grinning foolishly every time I write or say the word.
tldr: Scraping twitter, handing batches to online LDA with 100 topics, results here.
 D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation, JMLR, 3(Jan):993-1022, 2003.
 M. Hoffman, D. Blei, and F. Bach, Online Learning for Latent Dirichlet Allocation, NIPS, 2010.