homepage Welcome to WebmasterWorld Guest from 54.197.110.151
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Stuck! Substitution #1 works but the rest of the url disappears
mod_rewrite, substitution, [N], 301, redirect
blurped



 
Msg#: 4606337 posted 7:41 am on Aug 31, 2013 (gmt 0)

I'm creating 301 redirects to remove the phrase e28099 which appears in many urls, sometimes multiple times. I can get the first substitution to work, but the rest of the url string drops off. Here's a hypo to clarify:

Hypo url pre-redirect http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs
Hypo url post-redirect http://example.com/childrens-home-fills-patients-needs
This is the rule that I'm using:

RewriteCond %{HTTP_HOST} ^example\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1 [R=301,L]


But it only puts out the first part: http://example.com/childrens

This is the first week I've really dug into redirects and rewrites at this level, enjoyed working on them so far despite some snags. I'd very much appreciate some guidance!

sidenote question, the canonical is set up and seems to work so that everythings directed to http://example.com...so can the 2nd RewriteCond can be dropped?

 

robzilla

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4606337 posted 1:42 pm on Aug 31, 2013 (gmt 0)

RewriteRule (.*)e28099(.*)? $1 [R=301,L]

But it only puts out the first part

You forgot $2 in the target. $1 is the first match of your regex search, $2 the second, and so on.

blurped



 
Msg#: 4606337 posted 4:08 pm on Aug 31, 2013 (gmt 0)

So that works for replacing the phrase, but then part of the url gets repeated.

Modified Code:
RewriteCond %{HTTP_HOST} ^example\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1$2 [R=301,L]


Result:
http://example.com/childrens-home-fills-patients-needs/news-archives/childrens-home-fills-patients-needs/

Any ideas?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4606337 posted 4:14 pm on Aug 31, 2013 (gmt 0)

This is one of those occasions where I would internally rewrite the request to a special "fix-e28099.php" script that groks the requested URL, then does a string replace operation on it to work out what the new URL should be, then sends a 301 HEADER with the new URL.

This keeps the RewriteRule simple and the PHP is processed only for those requests.

blurped



 
Msg#: 4606337 posted 4:22 pm on Aug 31, 2013 (gmt 0)

I really appreciate your help with this. Would you happen to have a code snippet for that by chance?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4606337 posted 10:09 pm on Aug 31, 2013 (gmt 0)

the canonical is set up and seems to work so that everythings directed to http://example.com...so can the 2nd RewriteCond can be dropped?

The Conditions aren't needed at all unless you need to exclude other domains or subdomains that pass through the same config/htaccess. But it can still be expressed as a single
^(www\.)?example\.com
without ending anchor (to allow for maverick requests that come in with port number attached)

Two more serious issues on the same theme:

-- RewriteRules go from most specific to most general, so at the point a request meets this rule, a canonicalization redirect has not yet happened. If it comes earlier, move it.

-- It seems as if the same conditions are intended to apply to both rules. If so, you need to list them over again each time. Conditions apply only to the immediately following rule.

In real life, does the string e28099 occur in any other URLs (not under the target hostname)? If not, you can simply leave off the conditions.

The form
(.*)e28099
is always a bit iffy. Are there any real-life limits to what might come before "e28099"? Put them in the rule, and non-matching requests will be out of there all the sooner.

So that works for replacing the phrase, but then part of the url gets repeated.

Modified Code:
<snip, see above about repeating Conditions>
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1$2 [R=301,L]

Result:
{ http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs }

http://example.com/childrens-home-fills-patients-needs/news-archives/childrens-home-fills-patients-needs/


You need to pay close attention to the ordering of rules here, because you've got multiple scenarios:

request contains "news/news-archives"
request contains "e28099" (one or more times)
request contains both

Unless your name is jdMorgan, do not use the [N] flag. Instead, rearrange your rules:
FIRST rule for requests containing both
THEN two rules for requests containing one pattern or the other

If the element e28099 never occurs more than two or three times (how did it get there? did your cat walk across the keyboard?) you can make separate RewriteRules. Otherwise make a quick php detour to get rid of all of them. Since the php page will end up issuing a redirect, it will also have to check for any other elements-- such as "news/news-archives" --that might occur in a request containing "e28099".

Finally: Targets of RewriteRules don't need quotation marks. In mod_rewrite, colons and directory slashes don't need to be escaped.

robzilla

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4606337 posted 10:14 pm on Aug 31, 2013 (gmt 0)

Sorry, I hadn't look at the rules in detail before. I think you have your rules upside down: you want to repeat ([N]-flag) the search and replace for paths containing 'e28099'. Only when you've replaced all occurrences, should you do the redirect (once).

RewriteCond %{HTTP_HOST} ^(www\.)?example\.com$
RewriteRule (.*)e28099(.*) $1$2 [N]
RewriteRule ^news/news-archives/(.*) /$1 [R=301,L]


Untested, and you might have to add a trailing slash before
news/ if you're editing httpd.conf (preferable) rather than .htaccess.

Postscript: Sound advice from lucy24. Try to reduce the impact of any and all regular expressions whenever you can.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4606337 posted 6:38 am on Sep 1, 2013 (gmt 0)

This rule goes near the beginning of your htaccess (after all of the rules that block, and before most of the rules that redirect):

RewriteRule e28099 /fix-this.php [L]


On your non-www to www redirect (further down your htaccess file) you must add this exclusion:

RewriteCond %{THE_REQUEST} !e28099


In the PHP file:

1. Extract the requested URL
2. Extract the requested path from that URL
3. String replace e28099 with nothing
4. Prepend protocol and www hostname to the new string
5. Send 301 HEADER


Step 4 is important because then it will always redirect all requests to www without the need for a separate canonicalisation redirect causing an unwanted redirection chain.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4606337 posted 7:35 am on Sep 1, 2013 (gmt 0)

3. String replace e28099 with nothing

3b. String replace news/news-archives/ with nothing.

There may be others too, but that's the one that came up in the OP. If "e28099" only occurs in filenames, there will at least be no need for
3c. String replace index.xtn with nothing ;)

blurped



 
Msg#: 4606337 posted 8:30 am on Sep 1, 2013 (gmt 0)

does the string e28099 occur in any other URLs (not under the target hostname)? If not, you can simply leave off the conditions.

Yes, the group that handled this site before did not have a proper redirect in place for an old domain name, resulting in duplicate content issues. Thank you for clarifying the conditions, I went ahead and removed them.

The form
(.*)e28099
is always a bit iffy. Are there any real-life limits to what might come before "e28099"? Put them in the rule, and non-matching requests will be out of there all the sooner.

ok, will do

You need to pay close attention to the ordering of rules here, because you've got multiple scenarios:

Unless your name is jdMorgan, do not use the [N] flag. Instead, rearrange your rules:
FIRST rule for requests containing both
THEN two rules for requests containing one pattern or the other

ok, that makes sense

how did it get there? did your cat walk across the keyboard?

The CMS the site was on before auto-generated page titles. That little phrase was the replacement for an apostrophe. But I like the cat story better :)

Targets of RewriteRules don't need quotation marks.

cool, fixed that too

Thanks for your help! I'm going to tinker around with it a bit more. The Redirect 301 is what is currently in place for the rest of the site urls. They work fine, but it seems like they might be inefficient since there's no flag for it to stop searching. So naturally, this little phrase problem led me to the Rewrite rule.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4606337 posted 9:06 am on Sep 1, 2013 (gmt 0)

That little phrase was the replacement for an apostrophe.

:: quick detour to Character Viewer ::

D'oh! UTF-8 E28099 = apostrophe = &rsquo;

childrene28099s-home-fills-patientse28099-needs = children's-home-fills-patients'-needs

Edit:
Even if you're doing it all in htaccess, you don't need three rulesets, because you can start with

^(?:news/news-archives/)?blahblah-including-e28099

and carry on as before. Just make sure the e28099 stuff comes before the rule involving "news/news-archive/" by itself.

[N] is icky because it doesn't just mean "keep executing this same rule over and over again until it rinses clean", it means "go all the way back to the beginning and do all your mod_rewrite stuff over again, including stuff you've already tested for that can't possibly have changed".

blurped



 
Msg#: 4606337 posted 5:40 pm on Sep 1, 2013 (gmt 0)

oh oh oh! This seems to work!

RewriteRule ^news/news-archives/(.*)e28099(.*)e28099(.*)$ http\:\/\/example\.org\/$1$2$3 [R=301,L]

RewriteRule ^news/news-archives/(.*)e28099(.*)$ http\:\/\/example\.org\/$1$2 [R=301,L]

RewriteRule ^news/news-archives/(.*)?$ http\:\/\/example\.org\/$1? [R=301,L]


Line #1 replaces the phase if used 2x:
{ http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs }
http://example.com/childrens-home-fills-patients-needs

Line #2 replaces the phrase if used only 1x:
{ http://example.com/news/news-archives/childrene28099s-home-fills }
http://example.com/childrens-home-fills

Line #3 redirects a page without the phrase
{ http://example.com/news/news-archives/children }
http://example.com/children

[edited by: phranque at 12:58 am (utc) on Sep 2, 2013]
[edit reason] exemplified domain [/edit]

blurped



 
Msg#: 4606337 posted 6:17 pm on Sep 1, 2013 (gmt 0)

Also, g1smd, thank you for the advice, I'll read up more about that

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4606337 posted 1:03 am on Sep 2, 2013 (gmt 0)

welcome to WebmasterWorld, blurped!


http\:\/\/example\.org\/$1$2$3

there is no need to escape anything in the Substitution string.
you should remove all those backslashes.

blurped



 
Msg#: 4606337 posted 2:14 am on Sep 2, 2013 (gmt 0)

there is no need to escape anything in the Substitution string.
you should remove all those backslashes.

yea, I noticed that just earlier and have been cleaning it up. Thank you all for your help! These forums are great & I'll definitely start contributing (and asking lots of questions in this area :) Glad to have found ya'll!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4606337 posted 4:04 pm on Sep 2, 2013 (gmt 0)

Multiple .* within those patterns makes for really inefficient code. The .* means "capture the entire remainder of the string". This will force many thousands of "back off and retry" trial matches per request.

There's not a more efficient pattern you can use in place of .* hence why handing those requests off to a PHP solution is a much better scenario.

[edited by: incrediBILL at 10:33 pm (utc) on Sep 4, 2013]
[edit reason] fixed typo [/edit]

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4606337 posted 11:20 pm on Sep 2, 2013 (gmt 0)

Do the URLs ever contain numerals? If not,

([a-z-]+)e28099

et cetera will reduce hiccuping. Since the URLs are no longer being actively created, you know how many e28099 there are. Is two the absolute ceiling?

You can say [a-z-]* for the last capture (the package after the final e28099) but the others should be + since you will never have two consecutive apostrophes, and never a leading one. Well, unless you've got a page whose name started out as
'tain't so
or similar.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4606337 posted 9:33 pm on Sep 4, 2013 (gmt 0)

This was in htaccess on a production server? I missed that part :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved