Regular expression match specific words

Forum Moderators: coopster

Message Too Old, No Replies

Regular expression match specific words

Globetrotter

7:13 am on Jul 7, 2011 (gmt 0)

I'm looking for a regular expression that matches specific words in a string. I've tried a lot of options but none of them seem to work.

An oversimplified version which sort of works is


preg_match('/(ko|ao|an)/i',$output)

I would like to match the words "ko" or "ao" anywhere in a string followed by a space. Besided this match i would also match the word "'an" . If any of those two conditions is true it should return true.

At this time it sort of works but it sometimes finds a match when it shouldn't (for example if the word "cacao" is in the string it will match while it should only match on "ao some other word").

lucy24

7:28 am on Jul 7, 2011 (gmt 0)

It's not totally clear what you mean by "word". Do you mean "a pair of free-standing letters"? Most RegEx dialects would say

\b(ko|an|ao)\b

where \b simply means "word break". Is there any possibility of numerals or lowlines adjoining the letters? Those count as "word" characters, so you'd have to exclude them too.

Is that apostrophe in "'an" intentional? That is, your intended match is 'an ?

Is this what you're trying to match?
\b[ka]o
(you can't see the space but in most RegEx flavors it's just another character ;))
and
'an\b
?

Globetrotter

7:49 am on Jul 7, 2011 (gmt 0)

Thank you for you quick reply. By a "word" I mean a pair of free-standing letters. So the \b did seem to do the trick!

'/\b(ko|an|\'ao)\b/i'

The extra quote for 'an was intentional :) I am always amazed how simple the solution is when you see it :)

Jonesy

4:34 pm on Jul 10, 2011 (gmt 0)

Then it will fail on ko, an, and 'ao' or 'ao.

lucy24

8:51 pm on Jul 10, 2011 (gmt 0)

'ao is indeed a problem because the ' means the word break has already happenened. But what's the problem with \bko\b or \ban\b? Or did you mean it exactly as you typed it: "ko" or "an" followed by a comma instead of a space? If they never actually occur in this context you can ignore the problem. Otherwise you have to be more exact. On the one side:

\b(ko|an)(?=\s)

And on the other

'ao(?=\s)

collapsing to

(\bko|\ban|'ao)(?=\s)

That's assuming you're allowed to use non-capturing lookaheads. It works in my text editor-- that is, it picks up the intended forms without the unintended ones, and it's OK with the \b where I put it. But that's a different RegEx dialect (I've got a choice of eight "flavors" but default to Ruby) so I can't swear it would work universally.

Jonesy

2:55 am on Jul 14, 2011 (gmt 0)

What I was trying to point out was you would
never get a hit on that sentence I posted...

You cannot blithely call a word something that is preceded by a
blank and followed by a blank. More exacting it is something that
is preceded and followed by "white space". White space includes
spaces, tabs, commas, periods, begin- and end-of-line markers, and
other special characters as the situation dictates.

If possessive words could be encountered, then an apostrophe would be needed in the white space. Etc., usw.

It gets worse if the source is not 'normal' human language textual
material. Jargon, technical text all present their own issues.

lucy24

4:30 am on Jul 14, 2011 (gmt 0)

White space includes spaces, tabs, commas, periods, begin- and end-of-line markers, and other special characters as the situation dictates.

In what dialect of RegEx? Every one I've ever met distinguishes between \W (non-word characters, meaning spaces, punctuation, or any old squiggle) and \s (spaces, meaning spaces of all kinds, tabs, and \r and \n).

This is assuming for the sake of discussion that the original string is limited to ASCII, so you don't have to deal with, say, "ao" tucked into the middle of a Greek word. Then you have to pore over the documentation and figure out which of the eighteen variants of \p{ASCII} your specific dialect uses.

It gets worse if the source is not 'normal' human language textual material.

I don't think the OP was talking about human words at all, just strings. I can't think offhand of a language in which "ko" "an" "'ao" ... and "cacao" are all words. And g### refuses to recognize the leading apostrophe (or is it a glottal stop?) even when I put every single thing in quotes.

Hm. I know someone who lives in Hawai'i. I could ask :)

Anyway, since OP hasn't come back with fresh problems, the ignore-the-complicated-stuff solutions probably worked.

Globetrotter

7:23 am on Jul 14, 2011 (gmt 0)

I was talking about human words :) Places anywhere in the world to be exactly. I've got another regex that strips out any 2 letter word. This is fine in most cases but sometimes it's not desirable. Therefore I added this extra regex to add an extra - (dash) so it wouldn�t be stripped out on the next regex.

I'm only using this regex on title's in my CMS so tabs and pagebreaks should not occur.