Find x words before and after keyword - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Find x words before and after keyword

Globetrotter

3:27 pm on Oct 2, 2012 (gmt 0)

10+ Year Member

I would like to find x number of words before and after a given keyword, to make it more visible on the webpage.
So if you have an example text from a database: “PHP is presently the most popular scripting language in use on the Internet. You are able to code almost anything with it.”

If the keyword is code and I would like to find 4 words before the keyword and 1 after the keyword I would like to find “You are able to code almost”. But if html code (e.g. <strong> or </strong> ) or a punctuation mark is found (!.?’”) it need to stop matching.

So if I would like to find 5 words before the keyword and 3 words after the keyword the result i would like to have is: "You are able to code".

After searching all day I’ve come really close.


$strResult = "PHP is presently the most popular scripting language in use on the Internet. You are able to code <strong>almost</strong> anything with it.";
$strPattern = "#(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}\b(code)\b(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,3}#iu";
preg_match($strPattern, $strResult, $arrMatch);

var_dump($arrMatch);

The only problem I’m not able to tackle is to let the matching stop when there is html(syntax) or a punctuation mark in the text I’m trying to match.
Any idea how to solve this?

lucy24

8:48 pm on Oct 2, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Preliminary disclaimer: I don't speak PHP. But I come from a background of making e-books, so this problem sounds wonderfully familiar.

So if I would like to find 5 words before the keyword and 3 words after the keyword the result i would like to have is: "You are able to code".

Did your cat eat the following three words?

Can we assume for the sake of discussion that your target text will never contain titles or abbreviations such as "Mr." or "Ph.D." -- or, in the alternative, that you normally write these without a period? For that matter, would your mid-sentence words ever be capitalized at all? It's easier if you can exclude all names. Capital letters would then only occur at the beginning of your utterance, unless you anticipate meeting the single word "I". (Doesn't seem likely in this context.)

or a punctuation mark is found (!.?’”)

Any punctuation mark, or these specific ones? You can't exclude a right single quote, because that will also cut out contractions: "You'll be flying high when you start using PHP to code your pages!" They are the same html character, whether you use ' or ’ or ’ And it seems like you should allow at least commas; they're not a major syntactic break. Finally what about numerals? In your own post you use "5" and "3" as words.

How will your quotation marks be encoded: as " or " (“ ” etc) or as “ and ” pairs? What encoding are you in? Are you working from the page source or the visible text?

"#(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}\b(code)\b(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,3}#iu"

Oi! What's my system language doing at the end of your string? ;)

One word looks like this:
\b[A-Z]?[a-z-]+(?:'[a-z]+)?\b

but multiple words will no longer have or need a following \b. Note too that \b is superfluous if you've followed the string with [^a-zA-Z'-]+. Unless you've got mixed-form words with numerals or lowlines in the middle; those can get messy. Instead you'll go to (keeping the sentence-initial option)

\b([A-Z]?[a-z-]+(?:'[a-z]+)?,?(?: [a-z-]+(?:'[a-z]+)?,?){4}) KEYWORDHERE((?: [a-z-]+(?:'[a-z]+)?,?){3})

Here I've used literal spaces. (See above about not speaking php.) Note exact position of spaces. The first quantity is 4 rather than 5 because the beginning word is coded differently. If I now decide that my keyword is "the" and ask the text editor to find any & all hits, it supplies me with (capitalization added):

would then only occur at THE beginning of your
and
my system language doing at THE end of your

Or, if the keyword is "your":

the sake of discussion that YOUR target text will
and
occur at the beginning of YOUR utterance, unless you

(illustrating the optional , which I put in my regex)

All this is of course assuming strictly ASCII text, as you'll get if your source is in modern English. Otherwise you'd have to replace [a-z] with the appropriate variant of \w -- but this gets language-specific. Both programming language and human language.

Globetrotter

8:13 am on Oct 5, 2012 (gmt 0)

10+ Year Member

Sorry for the late reply I had to rethink the whole reply a few times to be able to do to something with it. I tried the regex in your post, but because I’m trying to find keywords in a non-English language I’m facing some different problems. I also had to deal with HTML.

The last couple of days I’ve been experimenting a lot. And now I might have found a big part of the solution.

/(\w+\s){0,2}keywordhere(\s\w+){0,3}/

This works quite well but it still needs to skip HTML tags, and bb code which this regex dos not. I did manage to find a regex for that, but it’s now matching different things than before.

lucy24

9:42 am on Oct 5, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

it still needs to skip HTML tags

Skip, or stop? In the original post, it sounded as if you wanted the function to simply stop short when it meets an html tag. Are you looking for something to ignore the tags? For example if one word of the series is <i>italicized</i> you should carry on as if the markup weren't there?

HTML tag package: </?[^>]+>

It's easier to match if you have nice clean HTML, with tags opening right before words and closing right after them. (Also safer, for arcane compliant-user-agent reasons.)

word package:

(?:<[^>]+>)*([\w'-]+)(?:</[^>]+>)*,?\s+

Do you need to capture the series of words, or simply find them? Your example above looks as if you're looking for up to two words before and up to three words after, but not necessarily capturing anything. (Technically yes, but they look more like grouping parentheses.)

Both html tags and bbcode? Urk. HTML is easy because < and > don't have meaning in RegEx. Well, hardly ever. But bbcode is nasty.

(?:\[[^\]]+\])*

And then if you're using a constructor-type function you have to double all your backslashes.

(?:\[[^\]]+\]|<[^>]+>)*([\w'-]+)(?:\[[^\]]+\]|<[^>]+>)*,?\s+

\w is good, because it covers everything. No harm in including numerals, and your text probably doesn't have many lowlines _ in it.

I suppose it's no use asking why a close-parenthesis by itself turns into a wink if it's between a > and a *

Globetrotter

10:21 am on Oct 5, 2012 (gmt 0)

10+ Year Member

Thanks again for the great help! I appreciate that. I should have picked my words more carefully, but what I’m trying to find is x words before and x words after a keyword where the searching for extra words should stop when html code, or the begin or end of a sentence is found based on the characters .!? (or the max number of words before or after is reached). Let’s skip bb code for now (because I could parse bbcode to html first and then search for the text).

For all examples i use these "settings"
Keyword = popular
Words before keyword = 2
Words after keyword = 8

Example text one:
“PHP is presently the most popular scripting language in use on the Internet. You are able to code almost anything with it.”

Should give: “the most popular scripting language in use on the Internet”

Example text two:
“PHP is presently the most popular scripting language in use <strong>on</strong> the Internet. You are able to code almost anything with it.”

Should give: “the most popular scripting language in use”

To make it less complex i think i need to skip the 'and " part for now. Not sure if you can do anything about it.

In your example regexin the second post where would you put the keyword? And it seems not limited to x words before and after or am i mistaken :)

For the record i'm not an regex expert. I try based on trail and error and research every option on a cheat sheet :)

lucy24

8:36 pm on Oct 5, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This version

(?:\[[^\]]+\]|<[^>]+>)*([\w'-]+)(?:\[[^\]]+\]|<[^>]+>)*,?\s+

is a single word, with extra doodads to ignore html or php/bb tags. The word itself is the ([\w'-]+) in the middle. Put in lots of 'em --replacing \w+ or [\w'-]+ -- to make the complete package. If you leave out the tags option and fiddle with the original regex you're back at:

\b((?:[\w-]+(?:'[\w]+)?,?\s){0,2})KEYWORDHERE((?:\s[\w-]+(?:'[\w]+)?,?){0,8})

But I think you want at least {1,2} or {1,8} on each side. Otherwise you could come back with a bare "php". Many RegEx dialects will accept a simple {,2} but you'll need to double-check whether the implied first number is 0 or 1.

If you don't haha want to allow for contractions, simply leave out each
(?:'[\w]+)?
element. Similarly you can leave off all occurrences of
,?
if you're sure you don't want to continue across commas.

It's your call on whether you want to allow words to include - for hyphenated words. That's assuming you don't have-- boo! hiss! --em dashes expressed as -- instead of — or the actual UTF-8 character. (I use — because I edit in a monospaced font. Also  .) If you wanted to be double-safe you could say

(\w(?:-\w+)*(?:'\w+)?,?\s)

for each word.

You may also want to change \s to \s+ both to allow for multi-spaces-- since they don't affect the html-- and to cover yourself in case of line breaks. Since the Windows line break is two separate characters, \r and \n, some RegEx readers will interpret it as two spaces, though most will pretend it's merely \n. (Mine is bilingual so \r\n is taken as two characters-- and the $ anchor doesn't work in CRLF mode.) That makes the whole package

\b((?:\w(?:-\w+)*(?:'\w+)?,?\s+){1,2})KEYWORDHERE((?:\s+\w(?:-\w+)*(?:'\w+)?,?\s,?){1,8})