Forum Moderators: coopster

Message Too Old, No Replies

Keywords from text

How to get the keywords from text.

         

thing3b

10:26 am on May 5, 2004 (gmt 0)

10+ Year Member



Does anyone have an idea on how I could go about using PHP to get the keywords of a section of text.

For example, be able to take out the words (Does,have,an,on,how,I) from the first sentence of this post. It must also be able to cope with words like (Does, Doesn't), purals etc.

I am not expecting the code on how to do it, just a help on the login behind doing it, or the steps to program.

Thanks.

bufferzone

10:32 am on May 5, 2004 (gmt 0)

10+ Year Member



read this

[webmasterworld.com...]

It should give you an idea on how the SE's find the keywords

vincevincevince

1:50 pm on May 5, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



i think first you'll want read every word into an array, with their frequency of occurance

then you'll want to filter out words from a certain list (a it and ... etc...)

you can then implement some form of stem matching:
strip trailing s from words if they exist without trailing s in the text
strip er from words if they exist without trailing er in the text
etc...

then sort your array, you'll get some idea of the most commonly used words that aren't on your exceptions list

Netizen

7:23 pm on May 5, 2004 (gmt 0)

10+ Year Member



Depending on how complex you want to get you could:

a) split the text on spaces to get the words into an array

b) set up an array with all your "stop" words, i.e. the, as, is, etc

c) filter your words array with your stop words array

You could look into stemming the words before or after - there are a number of algorithms for doing this - the Porter stemming algorithm is a common one.

You could also look at words that occur in all your articles/pages and treat those are stop words.

It all depends on what you are trying to achieve.

Nova Reticulis

12:07 pm on May 6, 2004 (gmt 0)

10+ Year Member



also see token functions and preg_* functions