Forum Moderators: coopster

Message Too Old, No Replies

preg_replace/str_replace problem

Some strange problems!

         

ahmedtheking

10:34 pm on Sep 25, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, I have created a search engine script that goes through my database and indexes all the words. I created a 'noise word' list that gets rid of noise words. Now, i used str_replace and that didnt seem to work. Then i tried using preg_replace and then this is when i got some strange probs!

The script words on a foreach run where each page runs the loop with the words and all the noise words should be snuffed out in this loop. when theres only 1 page (therefore the loop is only run once), it seems fine and the noise words are taken out and so on. As soon as theres more than one, I get this:

Warning: preg_replace(): Unknown modifier 'a' in XXX

So then i go about escaping all the 'a's but that's no good because now the noise words around found out and removed!

Am i doing something wrong? I can attach the code but it's a bit long!

[edited by: coopster at 7:49 pm (utc) on Sep. 26, 2005]
[edit reason] no email sigs please and thanks :-) [/edit]

JamShady

11:52 pm on Sep 25, 2005 (gmt 0)

10+ Year Member



You probably forgot the delimiter which needs to go around the regular expression.

ahmedtheking

8:43 am on Sep 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



you mean "/XXX/"? It's there.

RonPK

4:36 pm on Sep 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The modifier is the part of the regexp after the end delimiter. So in /[a-z]{5}/i , i is the modifier.
Strange that you only get the warning in certain cases. Maybe you can post the regexp?

ahmedtheking

10:18 pm on Sep 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, here is the preg code:

// get rid of noise words
foreach ($noisewords as $k => $v) {
$noisewords[$k] = "/ ".strtolower($v)." /i";
}
foreach ($words as $kk => $v) {
$words[$kk] = " ".strtolower($v)." ";
}

// do find and replace
foreach ($words as $kkk => $v) {
$words[$kkk] = preg_replace($noisewords,"",$v);
}

and here is the noise words list (feel free to copy and use!):

<?php
// noise words
$noisewords = array(
"about","after","all","also","an","and","another","any","are","as","at","be","because","been","before","being",
"between","both","but","by","came","can","come","could","did","do","each","for","from","get","got","has","had","he",
"have","her","here","him","himself","his","how","if","in","into","is","it","like","make","many","me","might","more",
"most","much","must","my","never","now","of","on","only","or","other","our","out","over","said","same","see","should",
"since","some","still","such","take","than","that","the","their","them","then","there","these","they","this","those",
"through","to","too","under","up","very","was","way","we","well","were","what","where","which","while","who","with",
"would","you","your","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",
"0","1","2","3","4","5","6","7","8","9","&zwj;","&ensp;","&thinsp;","&emsp;","`","&acute;","&tilde;",
"\^","&macr;","&oline;","&uml;","&uml;","&cedil;","_","&shy;","-","&ndash;","&mdash;",
";",":","!","&iexcl;","\?","&iquest;","\.","&hellip;","&middot;","'","&lsquo;","&rsquo;","&sbquo;","&lsaquo;",
"&rsaquo;","&quot;","&ldquo;","&rdquo;","&bdquo;","&laquo;","&raquo;","\(","\)","\[","\]","\{","\}","&sect;","&para;",
"&copy;","&reg;","@","\*","\/","&frasl;&quot;","\\","&amp;","#","%","&permil;","&dagger;","&Dagger;","&bull;","&prime;",
"&Prime;","&circ;","&deg;","&larr;","&rarr;","&uarr;","&darr;","&harr;","&crarr;","&larr;","&uarr;","&rarr;","&darr;",
"&harr;","&forall;","&part;","&exist;","&empty;","&nabla;","&isin;","&notin;","&ni;","&prod;","&sum;","+","&plusmn;",
"&divide;","&times;","&not;","\¦","&brvbar;","~","&minus;","&lowast;","&radic;","&prop;","&infin;","&ang;","&and;",
"&or;","&cap;","&cup;","&int;","&there4;","&sim;","&cong;","&asymp;","&equiv;","&le;","&ge;","&sub;","&nsub;","&sup;",
"&sube;","&supe;","&oplus;","&otimes;","&perp;","&sdot;","&loz;","&spades;","&clubs;","&hearts;","&diams;","&curren;","&cent;",
"\$","&pound;","&yen;","&euro;","&weierp;","&sup1;","&frac12;","&frac14;","&sup2;","&sup3;","&frac34;","&ordf;",
"&aacute;","&aacute;","&agrave;","&agrave;","&agrave;","&acirc;","&acirc;","&aring;","&aring;","&auml;","&auml;",
"&atilde;","&atilde;","&aelig;","&aElig;",
"&Ccedil;","&ccedil;","&ccedil;","&eth;","&ETH;",
"&Eacute;","&eacute;","&Egrave;","&egrave;","&Ecirc;","&ecirc;","&euml;","&Euml;",
"&fnof;","&fnof;","&image;","&iacute;",
"&Iacute;","&Igrave;","&igrave;","&icirc;","&Icirc;","&Iuml;","&iuml;",
"&Ntilde;","&ntilde;","&ordm;",
"&Oacute;","&oacute;","&ograve;","&Ograve;","&ocirc;","&Ocirc;","&Ouml;","&ouml;","&otilde;","&Otilde;","&oelig;","&OElig;",
"o&ugrave;","&oslash;","&Oslash;","qu&rsquo;",
"&real;","&Scaron;","&scaron;","&szlig;","&trade;",
"&uacute;","&Uacute;","&Ugrave;","&ugrave;","&ucirc;","&Ucirc;","&uuml;","&Uuml;","&nbsp;",
"&Yacute;","&yacute;","&Yuml;","&yuml;","&thorn;","&THORN;","&alpha;","&alpha;","&beta;","&Beta;","&gamma;","&Gamma;",
"&delta;","&Delta;","&epsilon;","&Epsilon;","&Zeta;","&zeta;","&Eta;","&eta;","&Theta;","&theta;","&iota;","&Iota;",
"&kappa;","&Kappa;","&lambda;","&Lambda;","&mu;","&Mu;","&micro;","&nu;","&Nu;","&Xi;","&xi;","&Omicron;","&omicron;",
"&Pi;","&pi;","&Rho;","&rho;","&sigma;","&Sigma;","&sigmaf;","&Tau;","&tau;","&upsilon;","&Upsilon;","&Phi;","&phi;",
"&chi;","&Chi;","&Psi;","&psi;","&omega;","&Omega;","&alefsym;"
);
?>

Sorry its a bit long!

RonPK

8:40 am on Sep 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No problems here. Maybe you can insert echo $v; right before the preg_replace, in order to see what input causes the error.

BTW, you should try to use str_replace when you don't really need regular expressions, i.e. in simple replace operations as required here. It's faster and less demanding for the server.

BTW 2, this way HTML entities will not be stripped from words. So arriv&eacute; will remain as it is. (I get the impression that that is not what you want).

ahmedtheking

8:58 am on Sep 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, changed to str_replace, here is 'echo $v;':

welcome
to
firestarter
media
...
and
offline
contactfirestarter

(downsized!)

-- END --

If you have a look at the foreachs before the str_replace:

foreach ($noisewords as $k => $v) {
$noisewords[$k] = " ".strtolower($v)." ";
}
foreach ($words as $kk => $v) {
$words[$kk] = " ".strtolower($v)." ";
}

I've added spaces (ie " ".strtolower($v)." ";) so that it will only replace full, standalone words and not get rid of 'and' in 'random'.

Does that make sense? but it's still not working! :(

[edited by: ahmedtheking at 9:03 am (utc) on Sep. 27, 2005]

ahmedtheking

9:02 am on Sep 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hang on, I added echo after the str_replace and it echoed an array, so i print_r'd it, and got this:

FROM ECHO $V >>>> welcome
FROM PRINT_R AFTER STR_REPLACE >>>> Array ( [0] => welcome [1] => to [2] => firestarter [3] => media ... [130] => newslist )

(I've downsized the array btw!)

ahmedtheking

5:44 pm on Oct 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



anything?

RonPK

8:08 pm on Oct 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How come $v is an array? It's supposed to be the string value of an item in the array $words.

ahmedtheking

10:23 pm on Oct 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That was my fault! it was because i was echoing $words instead of $words[$kkk]!

ahmedtheking

10:29 pm on Oct 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Right still have a strange problem! The noise words are only filtered on the first foreach, here are the echos:

131
70
Indexed http://www.example.com/main.php?goto=index
259
259
Indexed http://www.example.com/main.php?goto=ab_index
167
167
Indexed http://www.example.com/main.php?goto=cn_index

key:
1st number is the amount of words
2nd number is the amount of words after the noise words have been taken away
URL

If i change the url to a different one, only the first has the words replaced!

[edited by: coopster at 1:04 am (utc) on Oct. 4, 2005]
[edit reason] generalized ulr per TOS [webmasterworld.com] [/edit]

RonPK

7:11 am on Oct 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You've lost me. Please show the relevant parts of the code.