homepage Welcome to WebmasterWorld Guest from 54.204.215.209
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Stuck! Substitution #1 works but the rest of the url disappears
mod_rewrite, substitution, [N], 301, redirect
blurped




msg:4606339
 7:41 am on Aug 31, 2013 (gmt 0)

I'm creating 301 redirects to remove the phrase e28099 which appears in many urls, sometimes multiple times. I can get the first substitution to work, but the rest of the url string drops off. Here's a hypo to clarify:

Hypo url pre-redirect http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs
Hypo url post-redirect http://example.com/childrens-home-fills-patients-needs
This is the rule that I'm using:

RewriteCond %{HTTP_HOST} ^example\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1 [R=301,L]


But it only puts out the first part: http://example.com/childrens

This is the first week I've really dug into redirects and rewrites at this level, enjoyed working on them so far despite some snags. I'd very much appreciate some guidance!

sidenote question, the canonical is set up and seems to work so that everythings directed to http://example.com...so can the 2nd RewriteCond can be dropped?

 

robzilla




msg:4606405
 1:42 pm on Aug 31, 2013 (gmt 0)

RewriteRule (.*)e28099(.*)? $1 [R=301,L]

But it only puts out the first part

You forgot $2 in the target. $1 is the first match of your regex search, $2 the second, and so on.

blurped




msg:4606424
 4:08 pm on Aug 31, 2013 (gmt 0)

So that works for replacing the phrase, but then part of the url gets repeated.

Modified Code:
RewriteCond %{HTTP_HOST} ^example\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1$2 [R=301,L]


Result:
http://example.com/childrens-home-fills-patients-needs/news-archives/childrens-home-fills-patients-needs/

Any ideas?

g1smd




msg:4606426
 4:14 pm on Aug 31, 2013 (gmt 0)

This is one of those occasions where I would internally rewrite the request to a special "fix-e28099.php" script that groks the requested URL, then does a string replace operation on it to work out what the new URL should be, then sends a 301 HEADER with the new URL.

This keeps the RewriteRule simple and the PHP is processed only for those requests.

blurped




msg:4606428
 4:22 pm on Aug 31, 2013 (gmt 0)

I really appreciate your help with this. Would you happen to have a code snippet for that by chance?

lucy24




msg:4606483
 10:09 pm on Aug 31, 2013 (gmt 0)

the canonical is set up and seems to work so that everythings directed to http://example.com...so can the 2nd RewriteCond can be dropped?

The Conditions aren't needed at all unless you need to exclude other domains or subdomains that pass through the same config/htaccess. But it can still be expressed as a single
^(www\.)?example\.com
without ending anchor (to allow for maverick requests that come in with port number attached)

Two more serious issues on the same theme:

-- RewriteRules go from most specific to most general, so at the point a request meets this rule, a canonicalization redirect has not yet happened. If it comes earlier, move it.

-- It seems as if the same conditions are intended to apply to both rules. If so, you need to list them over again each time. Conditions apply only to the immediately following rule.

In real life, does the string e28099 occur in any other URLs (not under the target hostname)? If not, you can simply leave off the conditions.

The form
(.*)e28099
is always a bit iffy. Are there any real-life limits to what might come before "e28099"? Put them in the rule, and non-matching requests will be out of there all the sooner.

So that works for replacing the phrase, but then part of the url gets repeated.

Modified Code:
<snip, see above about repeating Conditions>
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1$2 [R=301,L]

Result:
{ http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs }

http://example.com/childrens-home-fills-patients-needs/news-archives/childrens-home-fills-patients-needs/


You need to pay close attention to the ordering of rules here, because you've got multiple scenarios:

request contains "news/news-archives"
request contains "e28099" (one or more times)
request contains both

Unless your name is jdMorgan, do not use the [N] flag. Instead, rearrange your rules:
FIRST rule for requests containing both
THEN two rules for requests containing one pattern or the other

If the element e28099 never occurs more than two or three times (how did it get there? did your cat walk across the keyboard?) you can make separate RewriteRules. Otherwise make a quick php detour to get rid of all of them. Since the php page will end up issuing a redirect, it will also have to check for any other elements-- such as "news/news-archives" --that might occur in a request containing "e28099".

Finally: Targets of RewriteRules don't need quotation marks. In mod_rewrite, colons and directory slashes don't need to be escaped.

robzilla




msg:4606484
 10:14 pm on Aug 31, 2013 (gmt 0)

Sorry, I hadn't look at the rules in detail before. I think you have your rules upside down: you want to repeat ([N]-flag) the search and replace for paths containing 'e28099'. Only when you've replaced all occurrences, should you do the redirect (once).

RewriteCond %{HTTP_HOST} ^(www\.)?example\.com$
RewriteRule (.*)e28099(.*) $1$2 [N]
RewriteRule ^news/news-archives/(.*) /$1 [R=301,L]


Untested, and you might have to add a trailing slash before
news/ if you're editing httpd.conf (preferable) rather than .htaccess.

Postscript: Sound advice from lucy24. Try to reduce the impact of any and all regular expressions whenever you can.

g1smd




msg:4606529
 6:38 am on Sep 1, 2013 (gmt 0)

This rule goes near the beginning of your htaccess (after all of the rules that block, and before most of the rules that redirect):

RewriteRule e28099 /fix-this.php [L]


On your non-www to www redirect (further down your htaccess file) you must add this exclusion:

RewriteCond %{THE_REQUEST} !e28099


In the PHP file:

1. Extract the requested URL
2. Extract the requested path from that URL
3. String replace e28099 with nothing
4. Prepend protocol and www hostname to the new string
5. Send 301 HEADER


Step 4 is important because then it will always redirect all requests to www without the need for a separate canonicalisation redirect causing an unwanted redirection chain.

lucy24




msg:4606541
 7:35 am on Sep 1, 2013 (gmt 0)

3. String replace e28099 with nothing

3b. String replace news/news-archives/ with nothing.

There may be others too, but that's the one that came up in the OP. If "e28099" only occurs in filenames, there will at least be no need for
3c. String replace index.xtn with nothing ;)

blurped




msg:4606550
 8:30 am on Sep 1, 2013 (gmt 0)

does the string e28099 occur in any other URLs (not under the target hostname)? If not, you can simply leave off the conditions.

Yes, the group that handled this site before did not have a proper redirect in place for an old domain name, resulting in duplicate content issues. Thank you for clarifying the conditions, I went ahead and removed them.

The form
(.*)e28099
is always a bit iffy. Are there any real-life limits to what might come before "e28099"? Put them in the rule, and non-matching requests will be out of there all the sooner.

ok, will do

You need to pay close attention to the ordering of rules here, because you've got multiple scenarios:

Unless your name is jdMorgan, do not use the [N] flag. Instead, rearrange your rules:
FIRST rule for requests containing both
THEN two rules for requests containing one pattern or the other

ok, that makes sense

how did it get there? did your cat walk across the keyboard?

The CMS the site was on before auto-generated page titles. That little phrase was the replacement for an apostrophe. But I like the cat story better :)

Targets of RewriteRules don't need quotation marks.

cool, fixed that too

Thanks for your help! I'm going to tinker around with it a bit more. The Redirect 301 is what is currently in place for the rest of the site urls. They work fine, but it seems like they might be inefficient since there's no flag for it to stop searching. So naturally, this little phrase problem led me to the Rewrite rule.

lucy24




msg:4606553
 9:06 am on Sep 1, 2013 (gmt 0)

That little phrase was the replacement for an apostrophe.

:: quick detour to Character Viewer ::

D'oh! UTF-8 E28099 = apostrophe = &rsquo;

childrene28099s-home-fills-patientse28099-needs = children's-home-fills-patients'-needs

Edit:
Even if you're doing it all in htaccess, you don't need three rulesets, because you can start with

^(?:news/news-archives/)?blahblah-including-e28099

and carry on as before. Just make sure the e28099 stuff comes before the rule involving "news/news-archive/" by itself.

[N] is icky because it doesn't just mean "keep executing this same rule over and over again until it rinses clean", it means "go all the way back to the beginning and do all your mod_rewrite stuff over again, including stuff you've already tested for that can't possibly have changed".

blurped




msg:4606593
 5:40 pm on Sep 1, 2013 (gmt 0)

oh oh oh! This seems to work!

RewriteRule ^news/news-archives/(.*)e28099(.*)e28099(.*)$ http\:\/\/example\.org\/$1$2$3 [R=301,L]

RewriteRule ^news/news-archives/(.*)e28099(.*)$ http\:\/\/example\.org\/$1$2 [R=301,L]

RewriteRule ^news/news-archives/(.*)?$ http\:\/\/example\.org\/$1? [R=301,L]


Line #1 replaces the phase if used 2x:
{ http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs }
http://example.com/childrens-home-fills-patients-needs

Line #2 replaces the phrase if used only 1x:
{ http://example.com/news/news-archives/childrene28099s-home-fills }
http://example.com/childrens-home-fills

Line #3 redirects a page without the phrase
{ http://example.com/news/news-archives/children }
http://example.com/children

[edited by: phranque at 12:58 am (utc) on Sep 2, 2013]
[edit reason] exemplified domain [/edit]

blurped




msg:4606597
 6:17 pm on Sep 1, 2013 (gmt 0)

Also, g1smd, thank you for the advice, I'll read up more about that

phranque




msg:4606651
 1:03 am on Sep 2, 2013 (gmt 0)

welcome to WebmasterWorld, blurped!


http\:\/\/example\.org\/$1$2$3

there is no need to escape anything in the Substitution string.
you should remove all those backslashes.

blurped




msg:4606662
 2:14 am on Sep 2, 2013 (gmt 0)

there is no need to escape anything in the Substitution string.
you should remove all those backslashes.

yea, I noticed that just earlier and have been cleaning it up. Thank you all for your help! These forums are great & I'll definitely start contributing (and asking lots of questions in this area :) Glad to have found ya'll!

g1smd




msg:4606779
 4:04 pm on Sep 2, 2013 (gmt 0)

Multiple .* within those patterns makes for really inefficient code. The .* means "capture the entire remainder of the string". This will force many thousands of "back off and retry" trial matches per request.

There's not a more efficient pattern you can use in place of .* hence why handing those requests off to a PHP solution is a much better scenario.

[edited by: incrediBILL at 10:33 pm (utc) on Sep 4, 2013]
[edit reason] fixed typo [/edit]

lucy24




msg:4606905
 11:20 pm on Sep 2, 2013 (gmt 0)

Do the URLs ever contain numerals? If not,

([a-z-]+)e28099

et cetera will reduce hiccuping. Since the URLs are no longer being actively created, you know how many e28099 there are. Is two the absolute ceiling?

You can say [a-z-]* for the last capture (the package after the final e28099) but the others should be + since you will never have two consecutive apostrophes, and never a leading one. Well, unless you've got a page whose name started out as
'tain't so
or similar.

lucy24




msg:4607358
 9:33 pm on Sep 4, 2013 (gmt 0)

This was in htaccess on a production server? I missed that part :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved