Forum Moderators: coopster

Message Too Old, No Replies

Accessing MySQL's stopwords via PHP

         

csdude55

7:19 pm on Mar 28, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have my message board set up to show the subject in the URL. I'm not sure if it actually has value anymore other than just optics, but I originally did it for search engines.

So this thread might look like:

example.com/accessing-mysqls-stopwords-via-php/12345

I currently do this via PHP:

$uri_txt = preg_replace("/
^[\s-]+ | // opening whitespace
" |

// bad words
\b(
[a-z]'(?:s|t|d|ve|re|ll|m) | // contractions
an? |
about |
are |
a[st] |
b[ey] |
de |
for |
from |
how |
i[nst]? |
la |
o[fnr] |
th(?:at|e|is) |
to |
was |
what |
when |
where |
who |
will |
with
)\b |

// http, www, com
https?:\/\/ |
www\. |
\.(com|net|org|co|us) |

[^a-zA-Z0-9\s-] |
[\s-]+$ // trailing whitespace
/x", '', strtolower($text));


The list of words I auto-remove come from InnoDB's list of stopwords, but MyISAM has more like 300 words:

[dev.mysql.com...]

And I'm not entirely sure about whether I should remove contractions (isn't, aren't, who's, webmasterworld's, etc).

Before I manually code all of those words in to the preg_replace(), can you guys and gals suggest a faster / better / easier way to do this?

JayDub

2:38 am on Mar 29, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



That's a small enough list I think what you're doing should be fine. If it was longer you could probably save a blip of time by switching the full word list to an array() and using str_replace() then just running the preg for the [^a-z0-9\s-] pattern and the check for the space at the end. And if you enjoyed a challenge you could shorten what you have so it only checks each starting letter once with something like this: w(?:as|(?:h(?:at|e(?:n|re))|o)|i(?:ll|th)) | <-- EDITED: I think that's close now.

But personally I think what you have is fine and I would stick with the shorter InnoDB list and even leave the contractions, since you could have some unexpected/undesired results with the full MyISAM list since "one can't equal four or five" would be reduced to "equal" using the MyISAM stopword list.

csdude55

3:34 am on Mar 30, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's a good point on speed... I did a benchmark test, and if I could use str_replace() on the bad words it really would be about 10 times faster! I have this script running 20 times on some page loads, and while microseconds are "usually" negligible, I've been focused on load time for awhile so I still think about it.

The problem I hit, though, is that I can't figure out a way to set up word boundaries for str_replace. So this:

$text = str_replace(array('a', 'an'), '', $text;

removes every "a" in the string, not every "\ba\b".

If anyone can suggest a way to only match whole words with str_replace(), I'd love to have the speed increase :-)

And after some testing, I agree on contractions, it was catching too many things that were relevant.

JayDub

4:27 am on Mar 30, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



With str_replace you have to put the spaces or other boundaries in: $text = str_replace(array(' a ', ' an '), '', $text);

You should be able to use strpos & strrpos to see if the string starts or ends with one of the words, then use preg or substr_replace if it does, with something like this:

$words=array('a ','an ');
if(strpos($text,$words)===0) {
preg_replace('^'.$words.'\b',$text);
}

You reverse the space in the list for strrpos. By doing it that way you keep the regex engine from firing unless you need it.

csdude55

6:52 pm on Mar 30, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Interesting thing I just discovered... I moved the strings from the preg_replace() that didn't actually NEED regex to a separate str_replace(), and it cut the process time a LOT!

I'll post my entire benchmark test, but the one using str_replace() and preg_replace() combined was 10 times faster than using preg_replace() alone.

I would appreciate feedback on whether my test is accurate!

I ran it here:

[sandbox.onlinephpfunctions.com...]

// Test 1
$test = false;
$start_time = microtime(TRUE);

function urify($text) {
return
trim(
preg_replace("#[^a-z\d]+#", '-',
preg_replace("#
\b(
an?d?|
about|
are|
a[st]|
b[ey]|
for|
from|
how|
i[nst]?|
o[fnr]|
th(?:at|e|is)|
to|
was|
what|
when|
where|
who|
will|
with |

de|
el |
la
)\b\s* |

&quot; |

https?:// |
www\. |
@(?:gmail|yahoo|outlook|hotmail) |
\.(?:com|net|org|co|us)\b |

[^a-z\d_\s\"`~@\#&*()=+\;:/.?!]
#mx", '', strtolower($text)
)
), '\s-');
}

for ($i = 0; $i < 10000; $i++) {

$test = urify("this is a temp about test... for -csdude- 100% and isn't csdude's version of csdude@gmail.com, https://www.example.rest, and we're OK with it ");

}

$end_time = microtime(TRUE);

echo 'Test 1: ';
echo "#$test#<br>\n";
echo $end_time - $start_time;
echo "<br>\n<br>\n\n";

// Test 2
$test = false;
$start_time = microtime(TRUE);

function urifyTwo($text) {
return
trim(
preg_replace("#[^a-z\d]+#", '-',
preg_replace("#
\b(
an?d?|
about|
are|
a[st]|
b[ey]|
for|
from|
how|
i[nst]?|
o[fnr]|
th(?:at|e|is)|
to|
was|
what|
when|
where|
who|
will|
with |

de|
el |
la
)\b\s* |

[^a-z\d_\s\"`~@\#&*()=+\;:/.?!]
#mx", '', strtolower(
str_replace([
'&quot;',
'http://',
'https://',
'www.',
'@gmail',
'@yahoo',
'@outlook',
'@hotmail',
'.com',
'.net',
'.org',

// these next 2 could have false positives, so maybe move them back to regex
'.co',
'.us'
], '', $text))
)
), '\s-');
}

for ($i = 0; $i < 1000; $i++) {

$test = urifyTwo("this is a temp about test... for -csdude- 100% and isn't csdude's version of csdude@gmail.com, https://www.example.rest, and we're OK with it ");

}

$end_time = microtime(TRUE);

echo 'Test 2: ';
echo "#$test#<br>\n";
echo $end_time - $start_time;

csdude55

5:47 am on Mar 31, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry to reply to myself, but there was a typo in there that messed up everything... the first run was 10,000 iterations, the second was only 1,000!

The second (using str_replace) was still faster, but only marginally so.

preg_replace(): 0.41304802894592
preg_replace with str_replace(): 0.31034588813782

JayDub

7:10 am on Mar 31, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



I actually just remembered one of those 'old tricks' for using str_replace, so I'm thinking this (or something similar) might be even a bit quicker & obviously it could be condensed a bit, but I wanted it to be more readable/understandable:

$text="this is a temp about test... for -csdude- 100% and isn't csdude's version of example@gmail.com, https://www.example.rest, and we're OK with it ";

# Convert to lower case
$text=strtolower($text);

# Remove '
$text=str_replace("'",'',$text);

# Replace stop words ... Adding a space to the beginning and end of the text string allows the replace to work for stop words at the beginning and end of the string

$replace=array(' an ',' and ',' about ',' are ',' as ',' at ',' be ',' by ',' or ',' from ',' how ',' i ',' in ',' is ',' it ',' of ','on',' or ',' that ',' the ',' this ',' to ',' was ',' what ',' when ',' where ',' who ',' will ',' with ',' de ',' el ',' la ','&quot;','http://','https://', 'www.','@gmail','@yahoo','@outlook','@hotmail','.com ','.net ','.org ','.co ','.us ');
$text=str_replace($replace,' ',' '.$text.' ');

# Replace any non-letter or non-digit character remaining with a -
$text=preg_replace("#[^a-z0-9-]+#",'-',$text);
$text=trim($text);

csdude55

8:07 pm on Mar 31, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's a cool trick! I had to play with it a little, though, to catch something like, "what's this about?" Or worse, "hey guys!What should we talk about?"

With the original long text, this version is a bit faster; on 10,000 iterations, this one is 0.20 while my previous one with preg_replace() and str_replace() was 0.33. With the shorter text of "hey guys!What should we talk about?", though, they're virtually the same (0.118 vs 0.130). But faster is faster! And I think it's easier to read :-)

$text = "this is a temp about test... for -csdude- 100% and isn't csdude's version of example@gmail.com, https://www.example.rest, and we're OK with it ";
// $text = "hey guys!What should we talk about?";

function urify($text) {

$text = str_replace([
// common punctuation
'.',
'?',
'!',
'"', // double-quote
':',
',',

// bad words
' a ',
' an ',
' and ',
' about ',
' are ',
' as ',
' at ',
' be ',
' by ',
' for ',
' from ',
' how ',
' i ',
' in ',
' is ',
' it ',
' of ',
' on ',
' or ',
' that ',
' the ',
' this ',
' to ',
' was ',
' what ',
' when ',
' where ',
' who ',
' will ',
' with ',
' de ',
' el ',
' la ',
'&quot;',
'http //',
'https //',
'www ',
'@gmail',
'@yahoo',
'@outlook',
'@hotmail',
' com ',
' net ',
' org ',
' co ',
' us '
], ' ',

// Add space to the beginning and end
' ' .
// Convert to lower case
strtolower(
// Remove single quote
str_replace(["'", '`'], '', $text)
)
. ' ');

// Replace any non-letter or non-digit character remaining with a -, then
// trim from ends
return
trim(
preg_replace("/[^a-z\d]+/", '-', $text)
, '\s-');
}


I'm tempted to remove duplicate words, too; in the example above, I end up with:

temp-test-csdude-100-isnt-csdudes-version-example-example-rest-were-ok

I could convert it to an array, use array_unique() to remove dupes, then convert it back to a string, like so:

return implode('-', array_unique(explode('-', $text)));

That moves the run time up to 0.26, but I end up with a better result, I think:

temp-test-csdude-100-isnt-csdudes-version-example-rest-were-ok

JayDub

8:31 pm on Mar 31, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Nice! I think the removed duplicates is a good idea and even that way, like you said, it's faster than the original, so that's great :)