{"id":929,"date":"2008-08-22T22:54:00","date_gmt":"2008-08-23T05:54:00","guid":{"rendered":"http:\/\/islemaster.wordpress.com\/?p=68"},"modified":"2014-03-17T00:40:07","modified_gmt":"2014-03-17T07:40:07","slug":"humble-beginnings","status":"publish","type":"post","link":"https:\/\/www.bradleycbuchanan.com\/b\/humble-beginnings\/","title":{"rendered":"Humble Beginnings"},"content":{"rendered":"<p>I ran my first couple of training sets today.  I must confess, the results are not pretty.  Let&#8217;s start with the summary:<\/p>\n<p><strong>Summary<\/strong><\/p>\n<p>The training set for the <a href=\"http:\/\/svmlight.joachims.org\/\">text categorization example<\/a> given by Joachims contains 2000 weighted example vectors.  The precision of the resultant model, as estimated by svm_learn, is 93.07%.<\/p>\n<p>My first training set used a search for &#8220;cars&#8221; for positive examples and a search for &#8220;film -cars&#8221; for negative examples.  It contains 63 binary example vectors.  The estimated precision is 12.90%.<\/p>\n<p>My second training set used a search for &#8220;basketball&#8221; for positive examples and a search for &#8220;racing -basketball&#8221; for negative examples.  It contains 61 binary example vectors.  The estimated precision is 9.09%.  Furthermore, I turned the sentence &#8220;Michael Jordan is out to shoot some hoops on the court this week.&#8221; into a test vector.  It was categorized incorrectly.<\/p>\n<p><strong>Analysis<\/strong><\/p>\n<p>There aren&#8217;t the sort of statistics I was hoping to see.  There are a number of reasons why I might be getting these subpar results.<\/p>\n<ol>\n<li><strong>Quantity<\/strong><br \/>\nSixty-some example vectors simply aren&#8217;t going to stand up to the example set of 2000.  Of course, the internet is a big place, so there&#8217;s no reason (other than Google&#8217;s API limitations) that I shouldn&#8217;t be generating my own large training sets.<\/li>\n<li><strong>Quality (Part A)<\/strong><br \/>\nThe example set uses weighted vectors while my sets use only binary vectors.  In short, I&#8217;m not including information about word frequency, only about word appearance.<\/li>\n<li><strong>Quality (Part B)<\/strong><br \/>\nI don&#8217;t know how counterexamples were selected for the example set, but I&#8217;ll admit that my current strategy for finding negative examples is flawed.  The selection of a counterexample search term was arbitrary, and using a single search term probably produces an undesirably uniform counterexample set.<\/li>\n<li><strong>Quality (Part C)<\/strong><br \/>\nThe example set was generated by a system trained to ignore trivial words and to reduce complex words to word-parts for consistency.  My system currently has no such bells and whistles.  I had hoped that the equal presence of elements like markup in positive and negative examples would lead the vector machine to ignore those elements, but the results say otherwise.<\/li>\n<\/ol>\n<p>What remains to be seen is whether these factors can account for an 80% difference in accuracy.  Next steps:<\/p>\n<ol>\n<li><strong>Quantity<\/strong><br \/>\nTime to switch to Yahoo&#8217;s API and start pulling down large result sets.<\/li>\n<li><strong>Quality (Part A)<\/strong><br \/>\nI can try switching to using word frequency within a document, but I&#8217;ll need to modify my shared dictionary class to use the same weight calculation that the example set does.<\/li>\n<li><strong>Quality (Part B)<\/strong><br \/>\nI&#8217;ll either generate counterexamples using a set of searches over other category keywords, or just an OR search.  One counter-keyword is not enough.<\/li>\n<li><strong>Quality (Part C)<\/strong><br \/>\nI&#8217;ll start a word filter list to ignore low-content words like &#8220;the.&#8221;<\/li>\n<li><strong>Persistence<\/strong><br \/>\nEverything is runtime right now.  I need to rebuild some things and include a mechanism for saving and reloading a common dictionary at the very least.  I also need to be able to consult the dictionary to get a feel for which words it&#8217;s picking up.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>I ran my first couple of training sets today. I must confess, the results are not pretty. Let&#8217;s start with the summary: Summary The training set for the text categorization example given by Joachims contains 2000 weighted example vectors. The precision of the resultant model, as estimated by svm_learn, is 93.07%. My first training set&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[41,344],"class_list":["post-929","post","type-post","status-publish","format-standard","hentry","category-programmer","tag-svm-trainer","tag-svmlight"],"_links":{"self":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts\/929","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/comments?post=929"}],"version-history":[{"count":1,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts\/929\/revisions"}],"predecessor-version":[{"id":1221,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts\/929\/revisions\/1221"}],"wp:attachment":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/media?parent=929"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/categories?post=929"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/tags?post=929"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}