To-do from CCSC-NW 08

Here is a list of the potential changes to SVMTrainer that were suggested to me during this weekend’s conference.

Searcher

  • Implement conditions on acceptable web document sizes to optimize document retrieval time
  • Try using a small initial search as a seed to get other search terms and expand the diversity of my training set – Yahoo! Term Extraction might be good for this, too.

WordFilter

  • Try implementing WordNet in the WordFilter class
  • Find a use for Yahoo! Term Extraction

WebDocument

  • Implement parallelism in the retrieval of search results and the retrieval of web documents
  • Implement a document retrieval timeout and a URL blacklist to prevent hanging on bad downloads

Other

  • Investigate the use of SVMstruct for categorization/ranking problem in multiple dimensions
  • Start doing an independent check on the accuracy of trained sets by keeping 10% of results for categorization rather than training
  • Learn about Xi Alpha estimates and what exactly they mean

Focusing

The last month has taken me through a couple of choices in how to focus this project.  First, I attempted to design a ‘version 2’ that would use project files.  The idea was to give a project a group of categories to use, as well as persistent result sets and a dictionary that can be kept up-to-date with the results.  It didn’t take long to realize that, although this might be a good application, I needed to reduce my scope.  I needed to build the underlying set of classes that this sort of application would call to do its searching and parsing.

Coincidentally, my Systems Development instructor suggested that I might make my project more manageable by making it a platform for further research.  Instead of trying to study all of the variables involved, it would be productive to focus on making the code highly modular and well-documented, and then I could move on to doing a study of one variable.  In future years, students needing a research project could pick up the code and do a more thorough study of the variables that impact the quality of the training sets.  Alternatively,  a student could take the SVMTrainer classes and use them to implement a higher-level application.

So for my own purposes I am calling the current model ‘version 3.’  It is designed with several basic classes that are designed to be extended.  Here’s a summary:

  • The Searcher class is in charge of going online and retrieving a set of Documents.  (I am considering making the Searcher return a set of results and giving the implementation the job of creating Documents, but this is cleaner for now.)
  • The Document class converts its source text to a bag-of-words representation on construction.  It uses a DocumentParser to do so.  It also remembers whether it is supposed to be a positive or negative example.
  • A WebDocument is just a Document that is constructed with a URL that subsequently fetches itself.
  • The DocumentParser decides what parts of the document to process and splits that part into words that it wants to put into the word bag.  It asks a Lexicon for each word’s ID before it puts them in the word bag.
  • The Lexicon tracks all of the words it has seen, the number of times it’s seen each one and a unique ID for each.  It asks a WordFilter to preprocess every word it gets from the DocumentParser.
  • The WordFilter serves a dual purpose – to ignore low-content words (such as pronouns) and to unify different word forms and concepts.  It has been suggested to me that using WordNet synsets here to recognize synonyms would be a good study.
  • Finally, a SetGenerator will take a set of Documents, and (potentially using statistics from the DocumentParser) format them in the correct format for a training set using normalized word frequencies.  At this point, the Lexicon‘s dataset is also saved to disk so it can be reloaded and used to convert any text that needs categorization.

This is basically the content being shared on my poster at CCSC-NW 2008.

While implementing version 3 I switched to the Yahoo! Web Search API.  While writing this post, I just noticed another Yahoo! search service called Term Extraction that could simplify my project… I’m not sure how I missed it before.  I’ll just add that to my list of potential changes.

Oh, and some good news:  Shortly after writing the bare-bones implementation of version 3, training on “dogs -cats” with 896 examples returned a XiAlpha-estimate of precision of 47.60%, an encouraging improvement over the precisions I reported at the end of August.  It suggests that the concept really has potential and further research is merited.

Humble Beginnings

I ran my first couple of training sets today. I must confess, the results are not pretty. Let’s start with the summary:

Summary

The training set for the text categorization example given by Joachims contains 2000 weighted example vectors. The precision of the resultant model, as estimated by svm_learn, is 93.07%.

My first training set used a search for “cars” for positive examples and a search for “film -cars” for negative examples. It contains 63 binary example vectors. The estimated precision is 12.90%.

My second training set used a search for “basketball” for positive examples and a search for “racing -basketball” for negative examples. It contains 61 binary example vectors. The estimated precision is 9.09%. Furthermore, I turned the sentence “Michael Jordan is out to shoot some hoops on the court this week.” into a test vector. It was categorized incorrectly.

Analysis

There aren’t the sort of statistics I was hoping to see. There are a number of reasons why I might be getting these subpar results.

  1. Quantity
    Sixty-some example vectors simply aren’t going to stand up to the example set of 2000. Of course, the internet is a big place, so there’s no reason (other than Google’s API limitations) that I shouldn’t be generating my own large training sets.
  2. Quality (Part A)
    The example set uses weighted vectors while my sets use only binary vectors. In short, I’m not including information about word frequency, only about word appearance.
  3. Quality (Part B)
    I don’t know how counterexamples were selected for the example set, but I’ll admit that my current strategy for finding negative examples is flawed. The selection of a counterexample search term was arbitrary, and using a single search term probably produces an undesirably uniform counterexample set.
  4. Quality (Part C)
    The example set was generated by a system trained to ignore trivial words and to reduce complex words to word-parts for consistency. My system currently has no such bells and whistles. I had hoped that the equal presence of elements like markup in positive and negative examples would lead the vector machine to ignore those elements, but the results say otherwise.

What remains to be seen is whether these factors can account for an 80% difference in accuracy. Next steps:

  1. Quantity
    Time to switch to Yahoo’s API and start pulling down large result sets.
  2. Quality (Part A)
    I can try switching to using word frequency within a document, but I’ll need to modify my shared dictionary class to use the same weight calculation that the example set does.
  3. Quality (Part B)
    I’ll either generate counterexamples using a set of searches over other category keywords, or just an OR search. One counter-keyword is not enough.
  4. Quality (Part C)
    I’ll start a word filter list to ignore low-content words like “the.”
  5. Persistence
    Everything is runtime right now. I need to rebuild some things and include a mechanism for saving and reloading a common dictionary at the very least. I also need to be able to consult the dictionary to get a feel for which words it’s picking up.

Search Limits

Today I learned about some under-documented limits on Google’s AJAX search API.  While working on my Searcher class (that will eventually generate training sets for the SVM) I asked Java to print the first 50 page titles that Google returned.  Every time I ran the program I would get a JSONException after 28 results.  Upon further examination, I found that Google returned the following 400 Bad Request JSON whenever I sent a request with the parameter &start greater than 28:

{
"responseData": null,
"responseDetails": "out of range start",
"responseStatus": 400
}

This seemed a little absurd, considering that in previous queries Google claimed to have found over 14 million results for the same search terms. Naturally, I started digging online to see if anyone else had encountered this magic 28 barrier. I soon learned that the AJAX search API is limited to 32 results, and that in order to get all 32 you must include the &rsz=large directive in your request, dictating 8 results per request instead of 4.

This could really hinder the quality of my training sets. I suppose I can just add results to the 100 most recent for each category (I wrote a nice little class to do just that) but then it could take a while to build a diverse training set, several days even if the results changed every day. On the other hand, I read that Yahoo’s web search API offers up to 1000 results with a cap of 5000 queries in 24 hours. Switching to Yahoo might be a good option, if their results are kept as up-to-date as Google’s. I’ll have to do some research, or maybe make the search interface modular so I can try both.

Why JSON?

I was wondering as I researched last night why Google AJAX Search API was using JSON. I have never even heard of JSON (JavaScript Object Notation). I fully expected the AJAX API to be using XML… it is Asynchronous JavaScript And XML, after all. But, at least for the RESTfulinterface (another term I’ve never heard) Google has eschewed XML for the more compact JSON format.

Today I found a link on JSON.org to I can’t believe it’s not XML! by James Bennett back in 2006. I think it gives good reasons why both JSON and REST are good choices for Google, which are humorously summarized in this quote:

So the XML-the-protocol-stack people are more than a little bit scared and defensive right now because of the REST folks. And now here are these kids with their startup companies and their weblogs who are getting data exchange and even things that kind of look like APIs out of… JavaScript arrays? The XML guys are sitting up on the mountaintop like the Grinch, with his pile of stolen presents, wondering how Christmas still managed to happen: it came without specs! It came without hype! It came without angle brackets, envelopes or types!

So it may not be formal or robust, but JSON is cheap, it’s fast, and it works. And by now, it’s fairly widespread; having never heard of this protocol, I feel behind the times. I should bounce this off my advisor.