Regex question. Help!

Forum Moderators: coopster

Message Too Old, No Replies

Regex question. Help!

I had to post it somewhere...

httpwebwitch

3:41 pm on Apr 6, 2005 (gmt 0)

Hey. I have a tricky REGEX question. I've been hacking in my Regex Buddy for a long time and the time has come to ask for help. I figured PHP people might be the best folks to ask.

(mods, move this to a different forum if you think it would get answered better elsewhere)

I want to match 1, 2, and 3 word phrases from a large block of text. But I want to ignore any phrases that have punctuation or non-word characters.

For example, if my text is
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Vivamus ut sapien in libero scelerisque bibendum. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

I want to get matches for
Lorem
Lorem ipsum
Lorem ipsum dolor
ipsum
ipsum dolor
ipsum dolor sit
dolor
dolor sit
dolor sit amet

and so on.

Even better would be if it could also ignore words that are under 4 letters long. In the example above, I would not get matches for any phrases containing "sit"

((\b[\w]{4,}+\b))

help!

coopster

1:19 am on Apr 13, 2005 (gmt 0)

One way to approach this might be to go for the three-word phrases without punctuation first, then parse the found patterns to get the onesey-twosey phrases. No, that won't work either ... you know, the more I think about this it seems you are always going to get at least every word out of the text string, aren't you?

killroy

9:26 am on Apr 13, 2005 (gmt 0)

Yes, Sicne you can olny be either greedy or not, but not both, I would FIRST get all maximum length sentences as you describe, i.e. includign 4 or more words. so for example:

Hello World Over There, How Are you doing?

would yield:
"Hello World Over There"
and
"How Are you doing"

Then once you have those fragments, you have to get all permutations but a different algo, i.e. not regexes:

Hello,World,Over,There,Hello World,World Over,Over There,Hello World Over,World Over There

If you must use regexes, you could first get all 1,2 and 3 word phrases starting with the first word. Then remove the first word, repeat until done.

httpwebwitch

5:00 pm on Apr 13, 2005 (gmt 0)

I kind of solved the problem without regex, instead I split the text into an array and walked through it with a loop-within-a-loop-within-a-loop.

Next I'll take all those 1,2,and 3-word phrases and count them in the document itself and sort by frequency. That's a step where Regex will perform beautifully.

The point of this exercise is to scan a page of text and find the most common key phrases. For instance a page on the topic of house insurance would probably come up with "house" and "insurance" and then "house insurance" in the top spots.

It's a work in progress: a semantic topic matcher which will grab the most relevant key words for a home-grown search thing, to find other relevant documents similar to the current one.

I will eventually need to filter out stop words so my top phrase isn't always "the" or "a"...

If you have any ideas or advice, I'm always grateful.
:)