homepage Welcome to WebmasterWorld Guest from 50.19.199.154
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Regular expression match specific words
Globetrotter




msg:4336205
 7:13 am on Jul 7, 2011 (gmt 0)

I'm looking for a regular expression that matches specific words in a string. I've tried a lot of options but none of them seem to work.

An oversimplified version which sort of works is

preg_match('/(ko|ao|an)/i',$output)


I would like to match the words "ko" or "ao" anywhere in a string followed by a space. Besided this match i would also match the word "'an" . If any of those two conditions is true it should return true.

At this time it sort of works but it sometimes finds a match when it shouldn't (for example if the word "cacao" is in the string it will match while it should only match on "ao some other word").

 

lucy24




msg:4336208
 7:28 am on Jul 7, 2011 (gmt 0)

It's not totally clear what you mean by "word". Do you mean "a pair of free-standing letters"? Most RegEx dialects would say

\b(ko|an|ao)\b

where \b simply means "word break". Is there any possibility of numerals or lowlines adjoining the letters? Those count as "word" characters, so you'd have to exclude them too.

Is that apostrophe in "'an" intentional? That is, your intended match is 'an ?

Is this what you're trying to match?
\b[ka]o
(you can't see the space but in most RegEx flavors it's just another character ;))
and
'an\b
?

Globetrotter




msg:4336220
 7:49 am on Jul 7, 2011 (gmt 0)

Thank you for you quick reply. By a "word" I mean a pair of free-standing letters. So the \b did seem to do the trick!

'/\b(ko|an|\'ao)\b/i'

The extra quote for 'an was intentional :) I am always amazed how simple the solution is when you see it :)

Jonesy




msg:4337658
 4:34 pm on Jul 10, 2011 (gmt 0)

Then it will fail on ko, an, and 'ao' or 'ao.

lucy24




msg:4337715
 8:51 pm on Jul 10, 2011 (gmt 0)

'ao is indeed a problem because the ' means the word break has already happenened. But what's the problem with \bko\b or \ban\b? Or did you mean it exactly as you typed it: "ko" or "an" followed by a comma instead of a space? If they never actually occur in this context you can ignore the problem. Otherwise you have to be more exact. On the one side:

\b(ko|an)(?=\s)

And on the other

'ao(?=\s)

collapsing to

(\bko|\ban|'ao)(?=\s)

That's assuming you're allowed to use non-capturing lookaheads. It works in my text editor-- that is, it picks up the intended forms without the unintended ones, and it's OK with the \b where I put it. But that's a different RegEx dialect (I've got a choice of eight "flavors" but default to Ruby) so I can't swear it would work universally.

Jonesy




msg:4339329
 2:55 am on Jul 14, 2011 (gmt 0)

What I was trying to point out was you would
never get a hit on that sentence I posted...

You cannot blithely call a word something that is preceded by a
blank and followed by a blank. More exacting it is something that
is preceded and followed by "white space". White space includes
spaces, tabs, commas, periods, begin- and end-of-line markers, and
other special characters as the situation dictates.

If possessive words could be encountered, then an apostrophe would be needed in the white space. Etc., usw.

It gets worse if the source is not 'normal' human language textual
material. Jargon, technical text all present their own issues.

lucy24




msg:4339348
 4:30 am on Jul 14, 2011 (gmt 0)

White space includes spaces, tabs, commas, periods, begin- and end-of-line markers, and other special characters as the situation dictates.

In what dialect of RegEx? Every one I've ever met distinguishes between \W (non-word characters, meaning spaces, punctuation, or any old squiggle) and \s (spaces, meaning spaces of all kinds, tabs, and \r and \n).

This is assuming for the sake of discussion that the original string is limited to ASCII, so you don't have to deal with, say, "ao" tucked into the middle of a Greek word. Then you have to pore over the documentation and figure out which of the eighteen variants of \p{ASCII} your specific dialect uses.

It gets worse if the source is not 'normal' human language textual material.

I don't think the OP was talking about human words at all, just strings. I can't think offhand of a language in which "ko" "an" "'ao" ... and "cacao" are all words. And g### refuses to recognize the leading apostrophe (or is it a glottal stop?) even when I put every single thing in quotes.

Hm. I know someone who lives in Hawai'i. I could ask :)

Anyway, since OP hasn't come back with fresh problems, the ignore-the-complicated-stuff solutions probably worked.

Globetrotter




msg:4339385
 7:23 am on Jul 14, 2011 (gmt 0)

I was talking about human words :) Places anywhere in the world to be exactly. I've got another regex that strips out any 2 letter word. This is fine in most cases but sometimes it's not desirable. Therefore I added this extra regex to add an extra - (dash) so it wouldn’t be stripped out on the next regex.

I'm only using this regex on title's in my CMS so tabs and pagebreaks should not occur.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved