Soooo winter break is here! I had some time to spare (between working at Burst and founding Hive), so I expunged the remaining few hours in my day to do something more interesting. Here’s the overview:
At Hive (our startup), one of our technical needs involves taking a corpus of text a student types (such as this blog post, an essay, or simply notes in class) and using a computer to identify exactly what they’re writing about, while also providing intelligent suggestions to them. Sounds easy eh? Not really, the keyword here is “intelligent”. One can just extract all the nouns from a document and throw some relevant wikipedia articles, but that’s just annoying for the user (since there’s more to a document than just nouns). There had to be a smarter way to approach this.
As of writing this, I don’t have a great understanding of Natural Language Processing (and much less with complex algorithms like Latent Dirichlet Allocation), so I will be sharing what I learn here. Additionally, this will be a place to dump some of the results from my rookie methods in sorting large data like Wikipedia. Moreover, please comment if you have any ideas you would like me to try.
Building a crawler
Here’s how the program works: You run the program with a “seed” wikipedia article (which I chose to be London) and it will run through the article finding all the links within it. Then, it will download all the articles for those links, and recursively find all the links within each of those articles. So ideally, if this program runs for a million years, you can download all of wikipedia (about 40 gigs uncompressed if you’re curious). However, my intention was to only get 10-12 articles about London and do some analysis on that; we can scale up later.
Coding this up was pretty straight forward. I used the Restler library to download the article, then the Cheerio library to query the downloaded html page for wikipedia links. Then, I took each link and downloaded the corresponding article to re-do the process.
After 2-3 minutes, my initial article on “London” extracted the following 11 articles in html format:
City of London
Palace of Westminster
Geographic coordinate system
List of sovereign states
Countries of the United Kingdom
Making Bumblebee (The backend)
So we have a crawler now that sifts through the articles and finds all the links, which is kinda cool. Since the crawler doesn’t actually store any data (and we want to be able to run multiple crawlers simultaneously), I had to make a program that will be in charge of all the crawlers (let’s call this ‘Bumblebee’).
Bumblebee is a program that keeps track of a list of articles to “crawl”, and manages all the individual instances of the crawler. To better explain this, here’s how the flow will work:
1. Bumblebee starts the first crawler on the document London
2. Crawler finds all the links within London and returns it back to bumblebee, along with the content within the article (the list of words in the article).
3. Bumblebee stores all the links within the MongoDB database, and takes the list of words that the crawler returned from the document and calculates the number of times each word occured, as well as its density within the article. For instance, if 100 of the 1400 words in the article was the word “Scottish“, then “Scottish” will have a density of 0.07 and a frequency of 100.
In the end, MongoDB will hold a list of documents and each document will have an integer totalWords, and an array of unique words. Each word in words will hold the string wordValue (like “scottish”), decimal value density (0.07), and an integer frequency (100). As you can imagine, this can get pretty large real quick, but I haven’t reached any issues (yet) on scaling since we are only dealing with 11 articles.
4. Bumblebee starts another crawler to sift through each of the documents referenced in the London article.
For the sake of naming things, lets call each word a “Pollen“. A “Pollen” object contains the wordValue, the density, and the frequency. As you can tell, I like to name things 😛
It takes about 1-2 seconds to download an article, and about 10-15 seconds to find all the links, compute the frequency/density of each word, and save each pollen in the database. For a total of 17-18 seconds per article, which is kinda slow.
Mapping the frequency (our first data analysis)
So now we have a small-ish database of 11 documents, all the words within the document, the density of each word within the document, and the frequency of each word within the document. Wouldn’t it be cool if we could visualize which words are most frequent within the 11 documents?
As you can see, each square represents a unique word that was found within ANY of the 11 documents that were parsed. When you put your mouse over any of the squares, the grey box will give you details about that word. In the above example, the word “London” occurred a total of 1822 times within the 11 documents that were parsed.
From this graph, we can tell (pretty obviously) that the word “London” is prominent within the 11 given articles. However, I also found words like “The” had a ridiculously high density within the documents, so I had to find a way to filter this out efficiently. For now, I said that if frequency was greater than some threshold, it would ignore it. However, there had to be a better way to do this (I’ll try this in the next section).
Since everything is indexed really well in Bumblebee and MongoDB, it takes about 10 seconds to download the data for 11 articles from bumblebee (only once), and about 1 second to generate this heat map.
Filtering out junk words
So I made another graph, and this time you can control the percentile of the data to ignore (words like “the” are far too frequent and mess up our data). Check out the video on top.