I was wondering as I researched last night why Google AJAX Search API was using JSON. I have never even heard of JSON (JavaScript Object Notation). I fully expected the AJAX API to be using XML… it is Asynchronous JavaScript And XML, after all. But, at least for the RESTfulinterface (another term I’ve never heard) Google has eschewed XML for the more compact JSON format.

Today I found a link on JSON.org to I can’t believe it’s not XML! by James Bennett back in 2006. I think it gives good reasons why both JSON and REST are good choices for Google, which are humorously summarized in this quote:

So the XML-the-protocol-stack people are more than a little bit scared and defensive right now because of the REST folks. And now here are these kids with their startup companies and their weblogs who are getting data exchange and even things that kind of look like APIs out of… JavaScript arrays? The XML guys are sitting up on the mountaintop like the Grinch, with his pile of stolen presents, wondering how Christmas still managed to happen: it came without specs! It came without hype! It came without angle brackets, envelopes or types!

So it may not be formal or robust, but JSON is cheap, it’s fast, and it works. And by now, it’s fairly widespread; having never heard of this protocol, I feel behind the times. I should bounce this off my advisor.

Google’s AJAX search from Java

My efforts today revolved around Google’s search API. It took a while to find out how to search Google from Java. The first lead I found was an old page from Pace University’s CS department, which mentioned a “Google API” and the need for a developer key. It didn’t take me long to find out that they were referring to the SOAP search API. Unfortunately, Google stopped issuing API keys back in 2006, so it’s not an option for me. Almost a dead end.

Almost. Instead, I was sent to the newer AJAX search API, which is more versatile, requires no key and has no daily limit on searches. It makes web search, news search and blog search available (among others) which could be useful for my application: Perhaps training the SVM on news will produce more accurate categorization on current events, or perhaps a mixture of searches will produce better results. In any case, the developer’s guide makes much ballyhoo about the standard Javascript/AJAX interface which is completely useless to someone wanting to use Java.

The light at the end of the tunnel (and the developer’s guide) is an exposed interface for Flash and other Non-Javascript Environments. It includes a single code snippet of Java and a link to JSON in Java, a set of free Java classes for handling the JSON format search results. The guide claims this as a RESTful interface (I had to look it up) and gives no implementation clues beyond a single comment:

JSONObject json = new JSONObject(builder.toString());
// now have some fun with the results...

So I fiddled with the code sample and JSON classes for a while and got it to work, in a limited capacity. The JSON format is actually very clear, once you figure out how to parse through it. I tried a few search queries, and discovered that the results returned by the AJAX interface aren’t necessarily the same as the ones returned by Google Web Search. Of more concern, however, I only got four results back at a time, and there were no instructions for getting the next page of results. One line in the output gave a “moreResultsUrl” but it pointed me to a web results page, not another JSON file.

The lack of instructions for handling results was frustrating, not to mention that I wasn’t sure how to utilize the news or blog searches. I’m only slightly ashamed to admit that it took me all evening to find this Guide to using the AJAX search RESTful interface sitting at the bottom of the AJAX Search’s Javascript class reference. I would think it deserved its own page since it is so different from the JS interface. In any case, I now understand how to use Google’s AJAX search from Java and have (barely) started on an application to generate training sets for the SVM. To be continued…

Self-Training Categorizer

I’m beginning a new project this month, to run through December. I’m going to learn how to train a Support Vector Machine (SVM) to categorize text, and then write a program that will automatically train the SVM using web searches generate training material. Once I’ve got a semblance of a working system, I’ll be building a ‘web game’ to evaluate the machine’s performance accuracy against human feedback. I hope that an automatically trained SVM will be able to catch references to current events in news and pop culture, and use those to assist in categorizing paragraphs of text.

I’ll be using SVMlight (or a related work from Thorsten Joachims of Cornell University) as the SVM backend. I just finished Probability, so the mathematics involved here are far beyond me; however, there is an SVM tutorial by Chris Burges out of Microsoft Research for those interested in the theory of SVMs.

My first challenge of this project is learning how to represent a text document as a vector. The most common representation (and the one used in the Inductive SVM example on Joachims’ page) is a Bag-of-Words or BOW. There’s a tutorial covering variations on the BOW model by José María Gómez Hidalgo of the Universidad Europea de Madrid. Basically, you build a dictionary for the categorization domain and then assign a value to each word based on whether it is in the document or not: Zero if the word does not appear, and either a one or a weighted value if it does. I think I will begin with a simple binary document representation while I work out the program flow, and tweak the representation later to see if it improves my results.