I ran my first couple of training sets today. I must confess, the results are not pretty. Let’s start with the summary:
The training set for the text categorization example given by Joachims contains 2000 weighted example vectors. The precision of the resultant model, as estimated by svm_learn, is 93.07%.
My first training set used a search for “cars” for positive examples and a search for “film -cars” for negative examples. It contains 63 binary example vectors. The estimated precision is 12.90%.
My second training set used a search for “basketball” for positive examples and a search for “racing -basketball” for negative examples. It contains 61 binary example vectors. The estimated precision is 9.09%. Furthermore, I turned the sentence “Michael Jordan is out to shoot some hoops on the court this week.” into a test vector. It was categorized incorrectly.
There aren’t the sort of statistics I was hoping to see. There are a number of reasons why I might be getting these subpar results.
Sixty-some example vectors simply aren’t going to stand up to the example set of 2000. Of course, the internet is a big place, so there’s no reason (other than Google’s API limitations) that I shouldn’t be generating my own large training sets.
- Quality (Part A)
The example set uses weighted vectors while my sets use only binary vectors. In short, I’m not including information about word frequency, only about word appearance.
- Quality (Part B)
I don’t know how counterexamples were selected for the example set, but I’ll admit that my current strategy for finding negative examples is flawed. The selection of a counterexample search term was arbitrary, and using a single search term probably produces an undesirably uniform counterexample set.
- Quality (Part C)
The example set was generated by a system trained to ignore trivial words and to reduce complex words to word-parts for consistency. My system currently has no such bells and whistles. I had hoped that the equal presence of elements like markup in positive and negative examples would lead the vector machine to ignore those elements, but the results say otherwise.
What remains to be seen is whether these factors can account for an 80% difference in accuracy. Next steps:
Time to switch to Yahoo’s API and start pulling down large result sets.
- Quality (Part A)
I can try switching to using word frequency within a document, but I’ll need to modify my shared dictionary class to use the same weight calculation that the example set does.
- Quality (Part B)
I’ll either generate counterexamples using a set of searches over other category keywords, or just an OR search. One counter-keyword is not enough.
- Quality (Part C)
I’ll start a word filter list to ignore low-content words like “the.”
Everything is runtime right now. I need to rebuild some things and include a mechanism for saving and reloading a common dictionary at the very least. I also need to be able to consult the dictionary to get a feel for which words it’s picking up.
Today I learned about some under-documented limits on Google’s AJAX search API. While working on my Searcher class (that will eventually generate training sets for the SVM) I asked Java to print the first 50 page titles that Google returned. Every time I ran the program I would get a JSONException after 28 results. Upon further examination, I found that Google returned the following 400 Bad Request JSON whenever I sent a request with the parameter
&start greater than 28:
"responseDetails": "out of range start",
This seemed a little absurd, considering that in previous queries Google claimed to have found over 14 million results for the same search terms. Naturally, I started digging online to see if anyone else had encountered this magic 28 barrier. I soon learned that the AJAX search API is limited to 32 results, and that in order to get all 32 you must include the
&rsz=large directive in your request, dictating 8 results per request instead of 4.
This could really hinder the quality of my training sets. I suppose I can just add results to the 100 most recent for each category (I wrote a nice little class to do just that) but then it could take a while to build a diverse training set, several days even if the results changed every day. On the other hand, I read that Yahoo’s web search API offers up to 1000 results with a cap of 5000 queries in 24 hours. Switching to Yahoo might be a good option, if their results are kept as up-to-date as Google’s. I’ll have to do some research, or maybe make the search interface modular so I can try both.
Today I found a link on JSON.org to I can’t believe it’s not XML! by James Bennett back in 2006. I think it gives good reasons why both JSON and REST are good choices for Google, which are humorously summarized in this quote:
So it may not be formal or robust, but JSON is cheap, it’s fast, and it works. And by now, it’s fairly widespread; having never heard of this protocol, I feel behind the times. I should bounce this off my advisor.
My efforts today revolved around Google’s search API. It took a while to find out how to search Google from Java. The first lead I found was an old page from Pace University’s CS department, which mentioned a “Google API” and the need for a developer key. It didn’t take me long to find out that they were referring to the SOAP search API. Unfortunately, Google stopped issuing API keys back in 2006, so it’s not an option for me. Almost a dead end.
JSONObject json = new JSONObject(builder.toString());
// now have some fun with the results...
So I fiddled with the code sample and JSON classes for a while and got it to work, in a limited capacity. The JSON format is actually very clear, once you figure out how to parse through it. I tried a few search queries, and discovered that the results returned by the AJAX interface aren’t necessarily the same as the ones returned by Google Web Search. Of more concern, however, I only got four results back at a time, and there were no instructions for getting the next page of results. One line in the output gave a “moreResultsUrl” but it pointed me to a web results page, not another JSON file.
I’m beginning a new project this month, to run through December. I’m going to learn how to train a Support Vector Machine (SVM) to categorize text, and then write a program that will automatically train the SVM using web searches generate training material. Once I’ve got a semblance of a working system, I’ll be building a ‘web game’ to evaluate the machine’s performance accuracy against human feedback. I hope that an automatically trained SVM will be able to catch references to current events in news and pop culture, and use those to assist in categorizing paragraphs of text.
I’ll be using SVMlight (or a related work from Thorsten Joachims of Cornell University) as the SVM backend. I just finished Probability, so the mathematics involved here are far beyond me; however, there is an SVM tutorial by Chris Burges out of Microsoft Research for those interested in the theory of SVMs.
My first challenge of this project is learning how to represent a text document as a vector. The most common representation (and the one used in the Inductive SVM example on Joachims’ page) is a Bag-of-Words or BOW. There’s a tutorial covering variations on the BOW model by José María Gómez Hidalgo of the Universidad Europea de Madrid. Basically, you build a dictionary for the categorization domain and then assign a value to each word based on whether it is in the document or not: Zero if the word does not appear, and either a one or a weighted value if it does. I think I will begin with a simple binary document representation while I work out the program flow, and tweak the representation later to see if it improves my results.