In order to foster greater creativity within myself, I would like to make a habit of sketching daily. To be honest, at this point in my life I don’t necessarily have time to sketch every day, but I figure it can’t hurt to start now – maybe after graduation I can sketch more often.
So I’ll just upload my sketches to this blog. In addition to pen-and-paper sketching, I’ve been practicing with Google SketchUp. So some of my sketches (like this first one) will actually be 3D sketches… neato.
It’s a Space Base! I know it looks like something out of Star Wars – the door on the left is straight out of Jedi Outcast. I’ve been on a Star Wars kick lately, and I’m not sure it will end anytime soon.
This is a 3D model – click on the image to go to the Google 3D Warehouse and download it to SketchUp. Or, you can find detail shots (different angles) on Flickr.
Juan Enriquez coins “Homo Evolutus,” predicts humans will take control over their own evolution within my lifetime.
I’m actually excited about this! What a fascinating threshold our faith will cross when the average person’s direct experience of creation is enhanced by technology.
The source code for SVMTrainer 0.30 is now available. Go to the new SVMTrainer page, download the code, and see if you can get better results than I did.
Here is a list of the potential changes to SVMTrainer that were suggested to me during this weekend’s conference.
- Implement conditions on acceptable web document sizes to optimize document retrieval time
- Try using a small initial search as a seed to get other search terms and expand the diversity of my training set – Yahoo! Term Extraction might be good for this, too.
- Try implementing WordNet in the WordFilter class
- Find a use for Yahoo! Term Extraction
- Implement parallelism in the retrieval of search results and the retrieval of web documents
- Implement a document retrieval timeout and a URL blacklist to prevent hanging on bad downloads
- Investigate the use of SVMstruct for categorization/ranking problem in multiple dimensions
- Start doing an independent check on the accuracy of trained sets by keeping 10% of results for categorization rather than training
- Learn about Xi Alpha estimates and what exactly they mean
The last month has taken me through a couple of choices in how to focus this project. First, I attempted to design a ‘version 2’ that would use project files. The idea was to give a project a group of categories to use, as well as persistent result sets and a dictionary that can be kept up-to-date with the results. It didn’t take long to realize that, although this might be a good application, I needed to reduce my scope. I needed to build the underlying set of classes that this sort of application would call to do its searching and parsing.
Coincidentally, my Systems Development instructor suggested that I might make my project more manageable by making it a platform for further research. Instead of trying to study all of the variables involved, it would be productive to focus on making the code highly modular and well-documented, and then I could move on to doing a study of one variable. In future years, students needing a research project could pick up the code and do a more thorough study of the variables that impact the quality of the training sets. Alternatively, a student could take the SVMTrainer classes and use them to implement a higher-level application.
So for my own purposes I am calling the current model ‘version 3.’ It is designed with several basic classes that are designed to be extended. Here’s a summary:
- The Searcher class is in charge of going online and retrieving a set of Documents. (I am considering making the Searcher return a set of results and giving the implementation the job of creating Documents, but this is cleaner for now.)
- The Document class converts its source text to a bag-of-words representation on construction. It uses a DocumentParser to do so. It also remembers whether it is supposed to be a positive or negative example.
- A WebDocument is just a Document that is constructed with a URL that subsequently fetches itself.
- The DocumentParser decides what parts of the document to process and splits that part into words that it wants to put into the word bag. It asks a Lexicon for each word’s ID before it puts them in the word bag.
- The Lexicon tracks all of the words it has seen, the number of times it’s seen each one and a unique ID for each. It asks a WordFilter to preprocess every word it gets from the DocumentParser.
- The WordFilter serves a dual purpose – to ignore low-content words (such as pronouns) and to unify different word forms and concepts. It has been suggested to me that using WordNet synsets here to recognize synonyms would be a good study.
- Finally, a SetGenerator will take a set of Documents, and (potentially using statistics from the DocumentParser) format them in the correct format for a training set using normalized word frequencies. At this point, the Lexicon‘s dataset is also saved to disk so it can be reloaded and used to convert any text that needs categorization.
This is basically the content being shared on my poster at CCSC-NW 2008.
While implementing version 3 I switched to the Yahoo! Web Search API. While writing this post, I just noticed another Yahoo! search service called Term Extraction that could simplify my project… I’m not sure how I missed it before. I’ll just add that to my list of potential changes.
Oh, and some good news: Shortly after writing the bare-bones implementation of version 3, training on “dogs -cats” with 896 examples returned a XiAlpha-estimate of precision of 47.60%, an encouraging improvement over the precisions I reported at the end of August. It suggests that the concept really has potential and further research is merited.