Search Limits

Today I learned about some under-documented limits on Google’s AJAX search API.  While working on my Searcher class (that will eventually generate training sets for the SVM) I asked Java to print the first 50 page titles that Google returned.  Every time I ran the program I would get a JSONException after 28 results.  Upon further examination, I found that Google returned the following 400 Bad Request JSON whenever I sent a request with the parameter &start greater than 28:

"responseData": null,
"responseDetails": "out of range start",
"responseStatus": 400

This seemed a little absurd, considering that in previous queries Google claimed to have found over 14 million results for the same search terms. Naturally, I started digging online to see if anyone else had encountered this magic 28 barrier. I soon learned that the AJAX search API is limited to 32 results, and that in order to get all 32 you must include the &rsz=large directive in your request, dictating 8 results per request instead of 4.

This could really hinder the quality of my training sets. I suppose I can just add results to the 100 most recent for each category (I wrote a nice little class to do just that) but then it could take a while to build a diverse training set, several days even if the results changed every day. On the other hand, I read that Yahoo’s web search API offers up to 1000 results with a cap of 5000 queries in 24 hours. Switching to Yahoo might be a good option, if their results are kept as up-to-date as Google’s. I’ll have to do some research, or maybe make the search interface modular so I can try both.

Google’s AJAX search from Java

My efforts today revolved around Google’s search API. It took a while to find out how to search Google from Java. The first lead I found was an old page from Pace University’s CS department, which mentioned a “Google API” and the need for a developer key. It didn’t take me long to find out that they were referring to the SOAP search API. Unfortunately, Google stopped issuing API keys back in 2006, so it’s not an option for me. Almost a dead end.

Almost. Instead, I was sent to the newer AJAX search API, which is more versatile, requires no key and has no daily limit on searches. It makes web search, news search and blog search available (among others) which could be useful for my application: Perhaps training the SVM on news will produce more accurate categorization on current events, or perhaps a mixture of searches will produce better results. In any case, the developer’s guide makes much ballyhoo about the standard Javascript/AJAX interface which is completely useless to someone wanting to use Java.

The light at the end of the tunnel (and the developer’s guide) is an exposed interface for Flash and other Non-Javascript Environments. It includes a single code snippet of Java and a link to JSON in Java, a set of free Java classes for handling the JSON format search results. The guide claims this as a RESTful interface (I had to look it up) and gives no implementation clues beyond a single comment:

JSONObject json = new JSONObject(builder.toString());
// now have some fun with the results...

So I fiddled with the code sample and JSON classes for a while and got it to work, in a limited capacity. The JSON format is actually very clear, once you figure out how to parse through it. I tried a few search queries, and discovered that the results returned by the AJAX interface aren’t necessarily the same as the ones returned by Google Web Search. Of more concern, however, I only got four results back at a time, and there were no instructions for getting the next page of results. One line in the output gave a “moreResultsUrl” but it pointed me to a web results page, not another JSON file.

The lack of instructions for handling results was frustrating, not to mention that I wasn’t sure how to utilize the news or blog searches. I’m only slightly ashamed to admit that it took me all evening to find this Guide to using the AJAX search RESTful interface sitting at the bottom of the AJAX Search’s Javascript class reference. I would think it deserved its own page since it is so different from the JS interface. In any case, I now understand how to use Google’s AJAX search from Java and have (barely) started on an application to generate training sets for the SVM. To be continued…