Accessing MySQL's stopwords via PHP

I have my message board set up to show the subject in the URL. I'm not sure if it actually has value anymore other than just optics, but I originally did it for search engines.

So this thread might look like:

example.com/accessing-mysqls-stopwords-via-php/12345

I currently do this via PHP:

$uri_txt = preg_replace("/
 ^[\s-]+ | // opening whitespace
 &quot; |

 // bad words
 \b(
 [a-z]'(?:s|t|d|ve|re|ll|m) | // contractions
 an? |
 about |
 are |
 a[st] |
 b[ey] |
 de |
 for |
 from |
 how |
 i[nst]? |
 la |
 o[fnr] |
 th(?:at|e|is) |
 to |
 was |
 what |
 when |
 where |
 who |
 will |
 with
 )\b |

 // http, www, com
 https?:\/\/ |
 www\. |
 \.(com|net|org|co|us) |

 [^a-zA-Z0-9\s-] |
 [\s-]+$ // trailing whitespace
/x", '', strtolower($text));

The list of words I auto-remove come from InnoDB's list of stopwords, but MyISAM has more like 300 words:

[dev.mysql.com...]

And I'm not entirely sure about whether I should remove contractions (isn't, aren't, who's, webmasterworld's, etc).

Before I manually code all of those words in to the preg_replace(), can you guys and gals suggest a faster / better / easier way to do this?

// Test 1 $test = false; $start_time = microtime(TRUE); function urify($text) { return trim( preg_replace("#[^a-z\d]+#", '-', preg_replace("# \b( an?d?| about| are| a[st]| b[ey]| for| from| how| i[nst]?| o[fnr]| th(?:at|e|is)| to| was| what| when| where| who| will| with | de| el | la )\b\s* | " | https?:// | www\. | @(?:gmail|yahoo|outlook|hotmail) | \.(?:com|net|org|co|us)\b | [^a-z\d_\s\"`~@\#&*()=+\;:/.?!] #mx", '', strtolower($text) ) ), '\s-'); } for ($i = 0; $i < 10000; $i++) { $test = urify("this is a temp about test... for -csdude- 100% and isn't csdude's version of csdude@gmail.com, https://www.example.rest, and we're OK with it "); } $end_time = microtime(TRUE); echo 'Test 1: '; echo "#$test#<br>\n"; echo $end_time - $start_time; echo "<br>\n<br>\n\n"; // Test 2 $test = false; $start_time = microtime(TRUE); function urifyTwo($text) { return trim( preg_replace("#[^a-z\d]+#", '-', preg_replace("# \b( an?d?| about| are| a[st]| b[ey]| for| from| how| i[nst]?| o[fnr]| th(?:at|e|is)| to| was| what| when| where| who| will| with | de| el | la )\b\s* | [^a-z\d_\s\"`~@\#&*()=+\;:/.?!] #mx", '', strtolower( str_replace([ '"', 'http://', 'https://', 'www.', '@gmail', '@yahoo', '@outlook', '@hotmail', '.com', '.net', '.org', // these next 2 could have false positives, so maybe move them back to regex '.co', '.us' ], '', $text)) ) ), '\s-'); } for ($i = 0; $i < 1000; $i++) { $test = urifyTwo("this is a temp about test... for -csdude- 100% and isn't csdude's version of csdude@gmail.com, https://www.example.rest, and we're OK with it "); } $end_time = microtime(TRUE); echo 'Test 2: '; echo "#$test#<br>\n"; echo $end_time - $start_time;

$text = "this is a temp about test... for -csdude- 100% and isn't csdude's version of example@gmail.com, https://www.example.rest, and we're OK with it "; // $text = "hey guys!What should we talk about?"; function urify($text) { $text = str_replace([ // common punctuation '.', '?', '!', '"', // double-quote ':', ',', // bad words ' a ', ' an ', ' and ', ' about ', ' are ', ' as ', ' at ', ' be ', ' by ', ' for ', ' from ', ' how ', ' i ', ' in ', ' is ', ' it ', ' of ', ' on ', ' or ', ' that ', ' the ', ' this ', ' to ', ' was ', ' what ', ' when ', ' where ', ' who ', ' will ', ' with ', ' de ', ' el ', ' la ', '"', 'http //', 'https //', 'www ', '@gmail', '@yahoo', '@outlook', '@hotmail', ' com ', ' net ', ' org ', ' co ', ' us ' ], ' ', // Add space to the beginning and end ' ' . // Convert to lower case strtolower( // Remove single quote str_replace(["'", '`'], '', $text) ) . ' '); // Replace any non-letter or non-digit character remaining with a -, then // trim from ends return trim( preg_replace("/[^a-z\d]+/", '-', $text) , '\s-'); }

Accessing MySQL's stopwords via PHP

csdude55

JayDub

csdude55

JayDub

csdude55

csdude55

JayDub

csdude55

JayDub

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week