Forum Moderators: not2easy
I need to create a ruleset as to how the script is to determine this. The first thought would be that if the two articles had enough similar words that they are similar in content. The problem is that just because two articles share the word "the" 10 times doesn't mean they have anything to do with each other.
My next thought was to create a list of stopwords, like "is", "the", "was", "it", etc.. and have those removed, then use what's left to indicate the topic. Unfortunately, I think more words than not would be considered stopwords, so this would be a long list to create by hand.
So I acquired a digital English dictionary, each word being labeled with its part of speech. This would be the beginning of my list of stopwords, and I'd remove words that did not belong on the list. I decided that nouns were the words that described the topic of an article, and wrote a script to remove all nouns from this list. Unfortunately, I found out that it's more than just nouns that describe an article. For instance, the word "juggling" appearing in an article, especially multiple times, is a good indicator of the topic of that article; though it is a verb. However, "is" is also a verb, so I can't say that all verbs can be used to describe an article.
I have no problem with the programming part.. I'm looking to create a rule. How can I categorize the words that would appear in an article as the words that describe it so I can start with an English dictionary as the base of a list of stopwords and easily remove these words from the list?