Forum Moderators: coopster

Message Too Old, No Replies

Extracting the ten most common words

         

turbohost

7:41 am on Oct 11, 2004 (gmt 0)

10+ Year Member



Hi Guys,

I want to extract the ten most common words out of a number of texts. The texts are located in a Mysql table and the rest of the website is written in php. How can I best solve this? Should I make a separate table with these keywords? Is there an easy way to extract the ten most common words?

Regards,
Turbo

mincklerstraat

8:23 am on Oct 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You're going to have to define 'extract' a little better here - you mean you want to find out what the ten most common words are as found in the texts as a whole, or in the texts separately?

Making an extra field sounds like a better idea to me than a whole extra table, if it's the second case. If it's the first case, it depends on whether you want to make this determination 'in one go' or if you want to 'save your work' along the way and go incrementally.

To do this I'd probably use preg_split with [a-zA-Z] somewhere in my regex.

turbohost

8:52 am on Oct 11, 2004 (gmt 0)

10+ Year Member



Well, I want to have a list of the ten most common words per text and I want to save the work along the way.

mincklerstraat

9:00 am on Oct 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Then just add an extra field to your table and put that list in there, either comma-separated or serialized.

To get the list, use preg_split to split your text into an array of words, separated by everything that's not A-Za-z. Then use array_count_values.