Forum Moderators: coopster
For example, be able to take out the words (Does,have,an,on,how,I) from the first sentence of this post. It must also be able to cope with words like (Does, Doesn't), purals etc.
I am not expecting the code on how to do it, just a help on the login behind doing it, or the steps to program.
Thanks.
then you'll want to filter out words from a certain list (a it and ... etc...)
you can then implement some form of stem matching:
strip trailing s from words if they exist without trailing s in the text
strip er from words if they exist without trailing er in the text
etc...
then sort your array, you'll get some idea of the most commonly used words that aren't on your exceptions list
a) split the text on spaces to get the words into an array
b) set up an array with all your "stop" words, i.e. the, as, is, etc
c) filter your words array with your stop words array
You could look into stemming the words before or after - there are a number of algorithms for doing this - the Porter stemming algorithm is a common one.
You could also look at words that occur in all your articles/pages and treat those are stop words.
It all depends on what you are trying to achieve.