Searching a string if it matches any of these words

Forum Moderators: coopster

Message Too Old, No Replies

Searching a string if it matches any of these words

ocon

12:23 am on Sep 30, 2009 (gmt 0)

I'm writing a script that sets a value if part of a word or phrase is found in a string.

Is this a good approach:

if(preg_match("/student¦school¦teacher¦classroom¦class\sroom/ismU",$text)>0) $category="education";

jd01

5:32 am on Sep 30, 2009 (gmt 0)

It is a good approach, but personally, since there doesn't seem to be the need for the power of preg_match's regex ability I would probably be inclined to use stripos(); because it's more efficient to process.

Something like this should be close:

$WordsToFind='student^school^teacher^classroom^class room';
$WordsToFind=explode('^',$WordsToFind);

for($i=0;$i<count($WordsToFind);$i++) {
if(stripos($text,$WordsToFind[$i])) { $category='education'; break; }
}

BTW: Welcome to WebmasterWorld!

EDITED: Forgot stripos() is even less memory intensive than stristr() for a min.

rocknbil

8:18 pm on Sep 30, 2009 (gmt 0)

Does strpos account for case insensitivity? Note the "i" modifier for the original regexp.

It's really going to depend on context. If these words come from user input - even if it's input one time by an administrator - the potential always exists for something you didn't expect, Class Room or classroom, for example. Then there's also the potential for changes and additions. Regexps will provide the maximum flexibility.

I suggest a little of both:

// Collect an array, doesn't matter if it's hard coded at
// the top of the script, from a database, or user input
// note * means zero or more, matches on student or students

$category=NULL;

$edu = Array(
'students*',
'schools*',
'teachers*',
'class\s*room\s*,
'uni',
'university*i*e*s*'
);

foreach ($edu as $wd) {
if(preg_match('/$wd/i',$text)) {
$category='education';
break;
}
}

if ($category==NULL) { echo "Whoops no category found"; }

jd01

8:46 pm on Sep 30, 2009 (gmt 0)

Uh, no strpos() doesn't, stripos() does.
[us2.php.net...]

Please note the 'i' between str and pos in the code I suggested.

And, of course, this might work too:

$WordsToFind='student^school^teacher^classroom^class room';
$WordsToFind=explode('^',$WordsToFind);

for($i=0;$i<count($WordsToFind);$i++) {
if(strpos(strtolower($text),$WordsToFind[$i])) { $category='education'; break; }
}

* It's also unnecessary to check for the plural using stripos, because we are not checking for \sword\s we are checking to see if the word (string) is contained in the string, so, if the string contains schools stripos will return a 'true' match for the 'needle' school.

** According to [phpbench.com...] a foreach loop is ridiculous compared to using either a for or a while statement.

Also:
From the documentation:

Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
[us3.php.net...]
From the documentation:
Note: If you only want to determine if a particular needle occurs within haystack , use the faster and less memory intensive function strpos() instead.
[us3.php.net...]
One final note:
$edu = Array(
'students*',
'schools*',
'teachers*',
'class\s*room\s*,
'uni',
/ * The preceding 'uni' will match union, uniform, unicode, punitive, and anything else containing uni, including universities or university, because you do not set a boundary on the match anywhere within your pattern to match or your regex. It's completely un-anchored and will perform *very* unexpectedly. */
'university*i*e*s*'
);
foreach ($edu as $wd) {
if(preg_match('/$wd/i',$text)) {
$category='education';
break;
}
}

ocon

3:48 am on Oct 1, 2009 (gmt 0)

Thank you very much for your replies. I knew about strpos, but didn't use it because of the case sensitivity issue, I didn't know about stripos, that's very useful.

With this new approach, I now have a whole page of category filters, and I'm wondering if there's another step I should take.

$needle=aray("student","school","teacher","classroom","class room");
for($i=0;$i<count($needle);$i++) if(stripos($haystack,$needle[$i])){ $category.="159,";break; }

$needle=aray("marriage","civil union","domestic partner","wedding");
for($i=0;$i<count($needle);$i++) if(stripos($haystack,$needle[$i])){ $category.="59,";break; }

$needle=aray("military","army","navy","air force","marine");
for($i=0;$i<count($needle);$i++) if(stripos($haystack,$needle[$i])){ $category.="256,";break; }

$needle=aray("vote","Rep.","Sen.","republican","democrat","elected","mayor","government","political");
for($i=0;$i<count($needle);$i++) if(stripos($haystack,$needle[$i])){ $category.="347,";break; }

jd01

4:10 am on Oct 2, 2009 (gmt 0)

Hmmmmm...

I looked at this about noon and couldn't think of anything right away. I looked at it about an hour ago and couldn't think of anything right away... The only thing I would consider (and it should be benchmarked) is:

ADDED /EDITED:
$newHaystack=strtolower($haystack);

$newHaystack=explode(" ",$newHaystack);
$newHaystack=array_unique($newHastack);
$newHaystack=implode(" ",$newHaystack);

ADDED / EDITED:
strtolower 1st and switch from stripos to strpos...
You'll need to keep all your $needles in lowercase, or strtolower them too, but IMO by running a single case, you should compare faster overall since you are comparing so many times. if you are matching A or a there's two possibilities for every a... switch everything to the same case and there's only one. It cuts the possible matches down.

$needle=aray("military","army","navy","air force","marine");
for($i=0;$i<count($needle);$i++) if(strpos($newHaystack,$needle[$i])){ $category.="256,";break; }

IMO: It'll really depend on which is faster: stripos or array_unique and your exact application. The length of your $haystack will probably be a factor. The reason I think it might be an option is as soon as a word begins with a different letter or 'doesn't match', array_unique should break from matching that piece of the array and move on to the next one, where stripos is going to compare every letter for a match to every needle... IOW array_unique will break off matching an entire word as soon as a match is not found, while stripos will continue checking the entire word and since you are running stripos multiple times, eliminating duplicates may be to your advantage.

If you were only running stripos one-time-through a text-block I wouldn't worry about it, but the elimination of duplicates to be checked might show you some gains with the number of times you're running stripos on any given string.

* Make sure you keep the original intact, so you can put it in the appropriate place. ;)

jd01

5:48 am on Oct 2, 2009 (gmt 0)

BTW: I wrote out what I thought, then had the 'strtolower()' idea, so I added it... Go with what I mean more than exactly what I say for the rest of the post.

EG: I know I say stripos in the last couple paragraphs, after I said to switch to strpos, etc.

jd01

2:29 pm on Oct 2, 2009 (gmt 0)

In looking at this again, I think you're going to lose some of your two word matches out of the haystack if you try to eliminate duplicates, unless you check for matches with spaces (2+ words) prior to eliminating duplicates...

EG if air force is a match 3/4 of the way through, but air is found by itself in the first 1/4, you could lose the word combination air force, since my guess is the first air will be stored and the word force will not appear next to it, but rather 3/4 the way through.

Personally, I might try to think of a way to overcome this, maybe by using strpos for the two word combinations before the haystack gets the duplicates removed and then if there is not a 2 word match, eliminating duplicates and checking for the single words.

Sorry I didn't think of this earlier.

I still think it might be faster to use strtolower and strpos, even if you don't end up pulling it apart and removing duplicates, just because of the number of comparisons you will be making.

Also, make sure you put the 'most contributed to' categories at the top, so you don't run as many 'no match' cycles.