Humble Beginnings

I ran my first couple of training sets today. I must confess, the results are not pretty. Let’s start with the summary:

Summary

The training set for the text categorization example given by Joachims contains 2000 weighted example vectors. The precision of the resultant model, as estimated by svm_learn, is 93.07%.

My first training set used a search for “cars” for positive examples and a search for “film -cars” for negative examples. It contains 63 binary example vectors. The estimated precision is 12.90%.

My second training set used a search for “basketball” for positive examples and a search for “racing -basketball” for negative examples. It contains 61 binary example vectors. The estimated precision is 9.09%. Furthermore, I turned the sentence “Michael Jordan is out to shoot some hoops on the court this week.” into a test vector. It was categorized incorrectly.

Analysis

There aren’t the sort of statistics I was hoping to see. There are a number of reasons why I might be getting these subpar results.

  1. Quantity
    Sixty-some example vectors simply aren’t going to stand up to the example set of 2000. Of course, the internet is a big place, so there’s no reason (other than Google’s API limitations) that I shouldn’t be generating my own large training sets.
  2. Quality (Part A)
    The example set uses weighted vectors while my sets use only binary vectors. In short, I’m not including information about word frequency, only about word appearance.
  3. Quality (Part B)
    I don’t know how counterexamples were selected for the example set, but I’ll admit that my current strategy for finding negative examples is flawed. The selection of a counterexample search term was arbitrary, and using a single search term probably produces an undesirably uniform counterexample set.
  4. Quality (Part C)
    The example set was generated by a system trained to ignore trivial words and to reduce complex words to word-parts for consistency. My system currently has no such bells and whistles. I had hoped that the equal presence of elements like markup in positive and negative examples would lead the vector machine to ignore those elements, but the results say otherwise.

What remains to be seen is whether these factors can account for an 80% difference in accuracy. Next steps:

  1. Quantity
    Time to switch to Yahoo’s API and start pulling down large result sets.
  2. Quality (Part A)
    I can try switching to using word frequency within a document, but I’ll need to modify my shared dictionary class to use the same weight calculation that the example set does.
  3. Quality (Part B)
    I’ll either generate counterexamples using a set of searches over other category keywords, or just an OR search. One counter-keyword is not enough.
  4. Quality (Part C)
    I’ll start a word filter list to ignore low-content words like “the.”
  5. Persistence
    Everything is runtime right now. I need to rebuild some things and include a mechanism for saving and reloading a common dictionary at the very least. I also need to be able to consult the dictionary to get a feel for which words it’s picking up.

Self-Training Categorizer

I’m beginning a new project this month, to run through December. I’m going to learn how to train a Support Vector Machine (SVM) to categorize text, and then write a program that will automatically train the SVM using web searches generate training material. Once I’ve got a semblance of a working system, I’ll be building a ‘web game’ to evaluate the machine’s performance accuracy against human feedback. I hope that an automatically trained SVM will be able to catch references to current events in news and pop culture, and use those to assist in categorizing paragraphs of text.

I’ll be using SVMlight (or a related work from Thorsten Joachims of Cornell University) as the SVM backend. I just finished Probability, so the mathematics involved here are far beyond me; however, there is an SVM tutorial by Chris Burges out of Microsoft Research for those interested in the theory of SVMs.

My first challenge of this project is learning how to represent a text document as a vector. The most common representation (and the one used in the Inductive SVM example on Joachims’ page) is a Bag-of-Words or BOW. There’s a tutorial covering variations on the BOW model by José María Gómez Hidalgo of the Universidad Europea de Madrid. Basically, you build a dictionary for the categorization domain and then assign a value to each word based on whether it is in the document or not: Zero if the word does not appear, and either a one or a weighted value if it does. I think I will begin with a simple binary document representation while I work out the program flow, and tweak the representation later to see if it improves my results.