Forum Moderators: coopster
(mods, move this to a different forum if you think it would get answered better elsewhere)
I want to match 1, 2, and 3 word phrases from a large block of text. But I want to ignore any phrases that have punctuation or non-word characters.
For example, if my text is
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Vivamus ut sapien in libero scelerisque bibendum. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
I want to get matches for
Lorem
Lorem ipsum
Lorem ipsum dolor
ipsum
ipsum dolor
ipsum dolor sit
dolor
dolor sit
dolor sit amet
and so on.
Even better would be if it could also ignore words that are under 4 letters long. In the example above, I would not get matches for any phrases containing "sit"
((\b[\w]{4,}+\b))
help!
Hello World Over There, How Are you doing?
would yield:
"Hello World Over There"
and
"How Are you doing"
Then once you have those fragments, you have to get all permutations but a different algo, i.e. not regexes:
Hello,World,Over,There,Hello World,World Over,Over There,Hello World Over,World Over There
If you must use regexes, you could first get all 1,2 and 3 word phrases starting with the first word. Then remove the first word, repeat until done.
SN
Next I'll take all those 1,2,and 3-word phrases and count them in the document itself and sort by frequency. That's a step where Regex will perform beautifully.
The point of this exercise is to scan a page of text and find the most common key phrases. For instance a page on the topic of house insurance would probably come up with "house" and "insurance" and then "house insurance" in the top spots.
It's a work in progress: a semantic topic matcher which will grab the most relevant key words for a home-grown search thing, to find other relevant documents similar to the current one.
I will eventually need to filter out stop words so my top phrase isn't always "the" or "a"...
If you have any ideas or advice, I'm always grateful.
:)