{"id":930,"date":"2008-10-10T23:37:58","date_gmt":"2008-10-11T06:37:58","guid":{"rendered":"http:\/\/islemaster.wordpress.com\/?p=76"},"modified":"2014-03-17T00:40:02","modified_gmt":"2014-03-17T07:40:02","slug":"focusing","status":"publish","type":"post","link":"https:\/\/www.bradleycbuchanan.com\/b\/focusing\/","title":{"rendered":"Focusing"},"content":{"rendered":"<p>The last month has taken me through a couple of choices in how to focus this project.\u00a0 First, I attempted to design a &#8216;version 2&#8217; that would use project files.\u00a0 The idea was to give a project a group of categories to use, as well as persistent result sets and a dictionary that can be kept up-to-date with the results.\u00a0 It didn&#8217;t take long to realize that, although this might be a good application, I needed to reduce my scope.\u00a0 I needed to build the underlying set of classes that this sort of application would call to do its searching and parsing.<\/p>\n<p>Coincidentally, my Systems Development instructor suggested that I might make my project more manageable by making it a platform for further research.\u00a0 Instead of trying to study all of the variables involved, it would be productive to focus on making the code highly modular and well-documented, and then I could move on to doing a study of one variable.\u00a0 In future years, students needing a research project could pick up the code and do a more thorough study of the variables that impact the quality of the training sets.\u00a0 Alternatively,\u00a0 a student could take the SVMTrainer classes and use them to implement a higher-level application.<\/p>\n<p>So for my own purposes I am calling the current model &#8216;version 3.&#8217;\u00a0 It is designed with several basic classes that are designed to be extended.\u00a0 Here&#8217;s a summary:<\/p>\n<ul>\n<li>The <strong>Searcher<\/strong> class is in charge of going online and retrieving a set of <strong>Document<\/strong>s.\u00a0 (I am considering making the Searcher return a set of results and giving the implementation the job of creating <strong>Document<\/strong>s, but this is cleaner for now.)<\/li>\n<li>The <strong>Document<\/strong> class converts its source text to a bag-of-words representation on construction.\u00a0 It uses a <strong>DocumentParser<\/strong> to do so.\u00a0 It also remembers whether it is supposed to be a positive or negative example.<\/li>\n<li>A <strong>WebDocument<\/strong> is just a <strong>Document<\/strong> that is constructed with a URL that subsequently fetches itself.<\/li>\n<li>The <strong>DocumentParser<\/strong> decides what parts of the document to process and splits that part into words that it wants to put into the word bag.\u00a0 It asks a <strong>Lexicon<\/strong> for each word&#8217;s ID before it puts them in the word bag.<\/li>\n<li>The <strong>Lexicon<\/strong> tracks all of the words it has seen, the number of times it&#8217;s seen each one and a unique ID for each.\u00a0 It asks a <strong>WordFilter<\/strong> to preprocess every word it gets from the <strong>DocumentParser<\/strong>.<\/li>\n<li>The <strong>WordFilter<\/strong> serves a dual purpose &#8211; to ignore low-content words (such as pronouns) and to unify different word forms and concepts.\u00a0 It has been suggested to me that using <a title=\"WordNet\" href=\"http:\/\/wordnet.princeton.edu\" target=\"_top\">WordNet<\/a> synsets here to recognize synonyms would be a good study.<\/li>\n<li>Finally, a <strong>SetGenerator<\/strong> will take a set of <strong>Document<\/strong>s, and (potentially using statistics from the <strong>DocumentParser<\/strong>) format them in the correct format for a training set using normalized word frequencies.\u00a0 At this point, the <strong>Lexicon<\/strong>&#8216;s dataset is also saved to disk so it can be reloaded and used to convert any text that needs categorization.<\/li>\n<\/ul>\n<p>This is basically the content being shared on my poster at <a title=\"CCSC-NW 2008\" href=\"http:\/\/www.ccsc.org\/northwest\/2008\/\" target=\"_top\">CCSC-NW 2008<\/a>.<\/p>\n<p>While implementing version 3 I switched to the <a href=\"http:\/\/developer.yahoo.com\/search\/\" target=\"_top\">Yahoo! Web Search API<\/a>.\u00a0 While writing this post, I just noticed another Yahoo! search service called <a href=\"http:\/\/developer.yahoo.com\/search\/content\/V2\/termExtraction.html\" target=\"_top\">Term Extraction<\/a> that could simplify my project&#8230; I&#8217;m not sure how I missed it before.\u00a0 I&#8217;ll just add that to my list of potential changes.<\/p>\n<p>Oh, and some good news:\u00a0 Shortly after writing the bare-bones implementation of version 3, training on &#8220;dogs -cats&#8221; with 896 examples returned a XiAlpha-estimate of precision of 47.60%, an encouraging improvement over the precisions I reported at the end of August.\u00a0 It suggests that the concept really has potential and further research is merited.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The last month has taken me through a couple of choices in how to focus this project.\u00a0 First, I attempted to design a &#8216;version 2&#8217; that would use project files.\u00a0 The idea was to give a project a group of categories to use, as well as persistent result sets and a dictionary that can be&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[36,41,375,380],"class_list":["post-930","post","type-post","status-publish","format-standard","hentry","category-programmer","tag-search-api","tag-svm-trainer","tag-wordnet","tag-yahoo"],"_links":{"self":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts\/930","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/comments?post=930"}],"version-history":[{"count":1,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts\/930\/revisions"}],"predecessor-version":[{"id":1222,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/posts\/930\/revisions\/1222"}],"wp:attachment":[{"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/media?parent=930"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/categories?post=930"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.bradleycbuchanan.com\/b\/wp-json\/wp\/v2\/tags?post=930"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}