Forum Moderators: coopster

Message Too Old, No Replies

preg replace Help please

remove unwanted characters from url

         

RateLadder

6:44 am on Aug 30, 2008 (gmt 0)

10+ Year Member



I have a bunch of html text. In this text are urls. Some of the urls are to locations in subdirectories... When they are urls to these subdirectories they have extra characters that need to be removed. when they arent't to the subdirectories I dont want to remove...

for example With these urls (http:// removed to break link but would exist in text)

www.<someurl>.com/money/es_MX/
www.<someurl>.com/money/de/
www.<someurl>.com/money/fr/
www.<someurl>.com/money/Coins-and-Paper-Money/Coins-Ancient/es_MX/

would become
www.<someurl>.com/money
www.<someurl>.com/money
www.<someurl>.com/money
www.<someurl>.com/money/Coins-and-Paper-Money/Coins-Ancient

Other links would not change
www.<someurl>.com/paper/blah/de/
www.<someurl>.com/paper/blah/fr/
www.<someurl>.com/paper/blah/es_MX/

there is a series of 6 subdirectories that need url fixing... Lets call them dir1...dir6

and a series of 32 possible /xx[xx]/ link ending that need removeing...

Any help is appreciated.

eelixduppy

3:24 pm on Aug 30, 2008 (gmt 0)



Hello and Welcome to WebmasterWorld! :)

Well, assuming the strings don't appear anywhere else in the URLs then you can create an array of the subdirectories you want to remove, then replace those with nothing. For example:


$remove = array(
'/es_MX/',
'/de/',
'/fr/',
'etc.....'
);
# then replace them
$html = str_replace($remove, '', $html);

Try that and see where it gets you.

coopster

3:30 pm on Aug 30, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Welcome to WebmasterWorld, RateLadder.

You could set the sub directories you want addressed in an array, apply each value to your pattern in your loop iteration and execute your preg_replace after the pattern has been adjusted. In pseudocode, something like this ...

$subs = array('money', 'anotherdir', 'nextdir', 'lastdir'); 
foreach ($subs as $sub) {
$pattern = "/pattern here utilizing your '$sub' variable/";
$subject = preg_replace($pattern, $replacement, $subject);
}

RateLadder

3:55 pm on Aug 30, 2008 (gmt 0)

10+ Year Member



Here is where I got to... but it doesn't work... I tested the pattern using EditPadPro... Also if I remove the # for the pattern I get a / is an unknown modifier....


<?php
$subs = "dir1圬ir2圬ir3圬ir4圬ir5圬ir6";
$domain = "somedomain";
$langs = "de如t如t_BR妄o圯s圯s_MX圩r夷t夸a屹h屹h_TW地r好l圯l字u好o圯n在g多r圭s圬a圩i多u夷s宇l如l字o存r存l圭y宇r奸a存v";
$src = fopen("src.txt","r");
$dest = fopen("dest.txt","w");
while (!feof($src))
{
$data = fgets($src);
fwrite($dest,preg_replace("#(http://www\.".domain."\.com/(?:".subs.")(?:/[^/]+)*)/(".langs.")/#","\1",$data));
}
fclose($src);
fclose($dest);
?>

RateLadder

4:09 pm on Aug 30, 2008 (gmt 0)

10+ Year Member



@eelixduppy

Sometimes the language directories are valid and desired. So that solution wont work...

@coopster

The pattern I used in the code above came from a regular expression that found all the patterns and stored in the first reference the new value in EditPadPro... I think I am just doing something wrong in php code or php regular expression syntax...

RateLadder

4:12 pm on Aug 30, 2008 (gmt 0)

10+ Year Member



Here is my test expression that worked in EditPadPro as is:

(http://www\.someurl\.com/(?:dir1¦dir2¦dir3¦dir4¦dir-4¦dir-5)(?:/[^/]+)*)/(es_MX¦de)/

coopster

4:27 pm on Aug 30, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



At first glance that looks good. So at this point I'm assuming you are having trouble incorporating that within your loop or perhaps just getting the pattern syntax down?

Also if I remove the # for the pattern I get a / is an unknown modifier

Perl compatible regular expressions are contained within delimiters [php.net]. You are using the pound sign (#) as the delimiter and as soon as you remove it you are going to get the error because a delimiter is required.

RateLadder

4:34 pm on Aug 30, 2008 (gmt 0)

10+ Year Member



What I get is an exact duplicate of the source file... with several links not adjusted as expected given the pattern testing in EditPadPro

RateLadder

4:40 pm on Aug 30, 2008 (gmt 0)

10+ Year Member



Ok... I I took out the variables and it worked... I put then variables back in and it worked... Not sure what, but I had a minor syntax error with the variables...

THANKS!