Varchar(MAX) – More like Varchar(AUTO)

Recently I was working with a database where certain tables were reporting page overflow errors. I subsequently learned a bit about how Microsoft SQL Server works, and this is my attempt to write it down.

SQL Server has a page size of 8KB. In simple terms, that’s about eight thousand characters of text. For a simple database table, an average row will be, what, 100 characters? Rows are made up of columns, and each column has a size. Say you’ve got ten columns of ten characters, which adds up to 100 characters per column. So you’ll squeeze about 80 rows of data onto a page. Great.

Sometimes you need to store a whole bunch of data in a row. Say, the complete text of an essay, or an image, or something that’s going to take more space than 8000 characters would. If you drop that data into your row, suddenly BOOM… your row is bigger than your page size. That’s no good. The classic solution is to use a BLOB data type: TEXT, NTEXT or IMAGE. When you use a BLOB, instead of taking up row space with this one particularly large column, you just drop a little signpost in the row that says “Look over there!” and store the data somewhere else. There are drawbacks – there’s a little delay in reading data because you have to follow the signpost, and BLOBs don’t permit sorting and string functions and all our favorite tricks. But it works pretty well.

In SQL Server 2005, Microsoft introduced three new data types: VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX). They are meant to replace the old BLOB data types, which are now deprecated. These new types are stored in-row as much as possible so we don’t have to worry about following signposts, but as soon as the row gets too long one of these will automatically pack its bags and move out, leaving a signpost behind. It’s just an automatic switch from one old column type to another. There are other benefits – sorting and string functions always work on a VARCHAR(MAX) – but we still have to follow a signpost if the row happens to have grown too large, and now we have an additional check every time the row size changes to see if anybody needs to move out.

The suggestions I have seen regarding VARCHAR(MAX) assume that it will be used in much the same situation the old TEXT data type was – single fields expected to exceed 8KB. However, this database had a different issue. It was already using TEXT fields where it seemed appropriate; the tables just had so many relatively short fields that they were overflowing. Two hundred columns of fifty characters each adds up to ten thousand characters, well beyond the page size. Now, these tables worked most of the time because they were using VARCHAR fields, which only store the amount of data they have to… If each column only had twenty-five characters, everything was good. But as soon as someone came along and used all the space they were allowed, things would break.

This creates a different situation where VARCHAR(MAX) is useful. If I made several columns VARCHAR(MAX) (in spite of only expecting them to store fifty characters) then the table would automatically adjust its size as needed, where before it would break. This might seem an extreme waste, but really the biggest cost we pay is when modifying the row and this system would pull data much more often than we put it in. If Microsoft feels its algorithm is efficient enough to eliminate the old BLOB types altogether, it’s probably okay. I won’t deny that some better database design would have alleviated this particular problem altogether, but refactoring was not an option.

I’m glad of the change; I think Microsoft is taking a step in the right direction here. However, I’d say they’re still way behind PostgreSQL. For some time now, all of PSQL’s character types have had this auto-collapsing feature, so that it’s actually most efficient to just use TEXT for all of your character fields. Freaky, but cool.

For more information on the new BLOB types from people with more technical know-how than me, check out Understanding VARCHAR(MAX) in SQL Server 2005, NTEXT vs NVARCHAR(MAX) in SQL 2005 and the official Char and Varchar (Transact-SQL).

Guncho

I recently heard about Gunchocomp through a blog post by Emily Short. Guncho is an extension of the Inform 7 interactive fiction platform that allows the user to create multiplayer/MUD experiences. Not much has been done with multi-user interactive fiction yet, so this competition sounds like a great opportunity to try something new.

I’ve never entered something into an interactive fiction competition before. I fiddled around with Inform 7 for a while and thoroughly enjoyed other people’s work, but I’ve never made anything substantial, myself. The more time I spend with the language, though, the more I admire it – natural language programming is an amazing experience.

So I think I’ll try and create an entry for GunchoComp this year. The deadline is August 6, 2009.

To-do from CCSC-NW 08

Here is a list of the potential changes to SVMTrainer that were suggested to me during this weekend’s conference.

Searcher

  • Implement conditions on acceptable web document sizes to optimize document retrieval time
  • Try using a small initial search as a seed to get other search terms and expand the diversity of my training set – Yahoo! Term Extraction might be good for this, too.

WordFilter

  • Try implementing WordNet in the WordFilter class
  • Find a use for Yahoo! Term Extraction

WebDocument

  • Implement parallelism in the retrieval of search results and the retrieval of web documents
  • Implement a document retrieval timeout and a URL blacklist to prevent hanging on bad downloads

Other

  • Investigate the use of SVMstruct for categorization/ranking problem in multiple dimensions
  • Start doing an independent check on the accuracy of trained sets by keeping 10% of results for categorization rather than training
  • Learn about Xi Alpha estimates and what exactly they mean

Focusing

The last month has taken me through a couple of choices in how to focus this project.  First, I attempted to design a ‘version 2’ that would use project files.  The idea was to give a project a group of categories to use, as well as persistent result sets and a dictionary that can be kept up-to-date with the results.  It didn’t take long to realize that, although this might be a good application, I needed to reduce my scope.  I needed to build the underlying set of classes that this sort of application would call to do its searching and parsing.

Coincidentally, my Systems Development instructor suggested that I might make my project more manageable by making it a platform for further research.  Instead of trying to study all of the variables involved, it would be productive to focus on making the code highly modular and well-documented, and then I could move on to doing a study of one variable.  In future years, students needing a research project could pick up the code and do a more thorough study of the variables that impact the quality of the training sets.  Alternatively,  a student could take the SVMTrainer classes and use them to implement a higher-level application.

So for my own purposes I am calling the current model ‘version 3.’  It is designed with several basic classes that are designed to be extended.  Here’s a summary:

  • The Searcher class is in charge of going online and retrieving a set of Documents.  (I am considering making the Searcher return a set of results and giving the implementation the job of creating Documents, but this is cleaner for now.)
  • The Document class converts its source text to a bag-of-words representation on construction.  It uses a DocumentParser to do so.  It also remembers whether it is supposed to be a positive or negative example.
  • A WebDocument is just a Document that is constructed with a URL that subsequently fetches itself.
  • The DocumentParser decides what parts of the document to process and splits that part into words that it wants to put into the word bag.  It asks a Lexicon for each word’s ID before it puts them in the word bag.
  • The Lexicon tracks all of the words it has seen, the number of times it’s seen each one and a unique ID for each.  It asks a WordFilter to preprocess every word it gets from the DocumentParser.
  • The WordFilter serves a dual purpose – to ignore low-content words (such as pronouns) and to unify different word forms and concepts.  It has been suggested to me that using WordNet synsets here to recognize synonyms would be a good study.
  • Finally, a SetGenerator will take a set of Documents, and (potentially using statistics from the DocumentParser) format them in the correct format for a training set using normalized word frequencies.  At this point, the Lexicon‘s dataset is also saved to disk so it can be reloaded and used to convert any text that needs categorization.

This is basically the content being shared on my poster at CCSC-NW 2008.

While implementing version 3 I switched to the Yahoo! Web Search API.  While writing this post, I just noticed another Yahoo! search service called Term Extraction that could simplify my project… I’m not sure how I missed it before.  I’ll just add that to my list of potential changes.

Oh, and some good news:  Shortly after writing the bare-bones implementation of version 3, training on “dogs -cats” with 896 examples returned a XiAlpha-estimate of precision of 47.60%, an encouraging improvement over the precisions I reported at the end of August.  It suggests that the concept really has potential and further research is merited.