Welcome to WebmasterWorld Guest from 54.211.86.24

Forum Moderators: Ocean10000 & incrediBILL & phranque

Stuck! Substitution #1 works but the rest of the url disappears

mod_rewrite, substitution, [N], 301, redirect

   
7:41 am on Aug 31, 2013 (gmt 0)



I'm creating 301 redirects to remove the phrase e28099 which appears in many urls, sometimes multiple times. I can get the first substitution to work, but the rest of the url string drops off. Here's a hypo to clarify:

Hypo url pre-redirect http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs
Hypo url post-redirect http://example.com/childrens-home-fills-patients-needs
This is the rule that I'm using:

RewriteCond %{HTTP_HOST} ^example\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1 [R=301,L]


But it only puts out the first part: http://example.com/childrens

This is the first week I've really dug into redirects and rewrites at this level, enjoyed working on them so far despite some snags. I'd very much appreciate some guidance!

sidenote question, the canonical is set up and seems to work so that everythings directed to http://example.com...so can the 2nd RewriteCond can be dropped?
1:42 pm on Aug 31, 2013 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



RewriteRule (.*)e28099(.*)? $1 [R=301,L]


But it only puts out the first part

You forgot $2 in the target. $1 is the first match of your regex search, $2 the second, and so on.
4:08 pm on Aug 31, 2013 (gmt 0)



So that works for replacing the phrase, but then part of the url gets repeated.

Modified Code:
RewriteCond %{HTTP_HOST} ^example\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1$2 [R=301,L]


Result:
http://example.com/childrens-home-fills-patients-needs/news-archives/childrens-home-fills-patients-needs/


Any ideas?
4:14 pm on Aug 31, 2013 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



This is one of those occasions where I would internally rewrite the request to a special "fix-e28099.php" script that groks the requested URL, then does a string replace operation on it to work out what the new URL should be, then sends a 301 HEADER with the new URL.

This keeps the RewriteRule simple and the PHP is processed only for those requests.
4:22 pm on Aug 31, 2013 (gmt 0)



I really appreciate your help with this. Would you happen to have a code snippet for that by chance?
10:09 pm on Aug 31, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



the canonical is set up and seems to work so that everythings directed to http://example.com...so can the 2nd RewriteCond can be dropped?

The Conditions aren't needed at all unless you need to exclude other domains or subdomains that pass through the same config/htaccess. But it can still be expressed as a single
^(www\.)?example\.com
without ending anchor (to allow for maverick requests that come in with port number attached)

Two more serious issues on the same theme:

-- RewriteRules go from most specific to most general, so at the point a request meets this rule, a canonicalization redirect has not yet happened. If it comes earlier, move it.

-- It seems as if the same conditions are intended to apply to both rules. If so, you need to list them over again each time. Conditions apply only to the immediately following rule.

In real life, does the string e28099 occur in any other URLs (not under the target hostname)? If not, you can simply leave off the conditions.

The form
(.*)e28099
is always a bit iffy. Are there any real-life limits to what might come before "e28099"? Put them in the rule, and non-matching requests will be out of there all the sooner.

So that works for replacing the phrase, but then part of the url gets repeated.

Modified Code:
<snip, see above about repeating Conditions>
RewriteRule ^news\/news\-archives\/?(.*)$ "http\:\/\/example\.com\/$1" [N]
RewriteRule (.*)e28099(.*)? $1$2 [R=301,L]

Result:
{ http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs }

http://example.com/childrens-home-fills-patients-needs/news-archives/childrens-home-fills-patients-needs/


You need to pay close attention to the ordering of rules here, because you've got multiple scenarios:

request contains "news/news-archives"
request contains "e28099" (one or more times)
request contains both

Unless your name is jdMorgan, do not use the [N] flag. Instead, rearrange your rules:
FIRST rule for requests containing both
THEN two rules for requests containing one pattern or the other

If the element e28099 never occurs more than two or three times (how did it get there? did your cat walk across the keyboard?) you can make separate RewriteRules. Otherwise make a quick php detour to get rid of all of them. Since the php page will end up issuing a redirect, it will also have to check for any other elements-- such as "news/news-archives" --that might occur in a request containing "e28099".

Finally: Targets of RewriteRules don't need quotation marks. In mod_rewrite, colons and directory slashes don't need to be escaped.
10:14 pm on Aug 31, 2013 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Sorry, I hadn't look at the rules in detail before. I think you have your rules upside down: you want to repeat ([N]-flag) the search and replace for paths containing 'e28099'. Only when you've replaced all occurrences, should you do the redirect (once).

RewriteCond %{HTTP_HOST} ^(www\.)?example\.com$
RewriteRule (.*)e28099(.*) $1$2 [N]
RewriteRule ^news/news-archives/(.*) /$1 [R=301,L]


Untested, and you might have to add a trailing slash before
news/
if you're editing httpd.conf (preferable) rather than .htaccess.

Postscript: Sound advice from lucy24. Try to reduce the impact of any and all regular expressions whenever you can.
6:38 am on Sep 1, 2013 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



This rule goes near the beginning of your htaccess (after all of the rules that block, and before most of the rules that redirect):

RewriteRule e28099 /fix-this.php [L]



On your non-www to www redirect (further down your htaccess file) you must add this exclusion:

RewriteCond %{THE_REQUEST} !e28099



In the PHP file:

1. Extract the requested URL
2. Extract the requested path from that URL
3. String replace e28099 with nothing
4. Prepend protocol and www hostname to the new string
5. Send 301 HEADER


Step 4 is important because then it will always redirect all requests to www without the need for a separate canonicalisation redirect causing an unwanted redirection chain.
7:35 am on Sep 1, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



3. String replace e28099 with nothing

3b. String replace news/news-archives/ with nothing.

There may be others too, but that's the one that came up in the OP. If "e28099" only occurs in filenames, there will at least be no need for
3c. String replace index.xtn with nothing ;)
8:30 am on Sep 1, 2013 (gmt 0)



does the string e28099 occur in any other URLs (not under the target hostname)? If not, you can simply leave off the conditions.

Yes, the group that handled this site before did not have a proper redirect in place for an old domain name, resulting in duplicate content issues. Thank you for clarifying the conditions, I went ahead and removed them.

The form
(.*)e28099
is always a bit iffy. Are there any real-life limits to what might come before "e28099"? Put them in the rule, and non-matching requests will be out of there all the sooner.

ok, will do

You need to pay close attention to the ordering of rules here, because you've got multiple scenarios:

Unless your name is jdMorgan, do not use the [N] flag. Instead, rearrange your rules:
FIRST rule for requests containing both
THEN two rules for requests containing one pattern or the other

ok, that makes sense

how did it get there? did your cat walk across the keyboard?

The CMS the site was on before auto-generated page titles. That little phrase was the replacement for an apostrophe. But I like the cat story better :)

Targets of RewriteRules don't need quotation marks.

cool, fixed that too

Thanks for your help! I'm going to tinker around with it a bit more. The Redirect 301 is what is currently in place for the rest of the site urls. They work fine, but it seems like they might be inefficient since there's no flag for it to stop searching. So naturally, this little phrase problem led me to the Rewrite rule.
9:06 am on Sep 1, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



That little phrase was the replacement for an apostrophe.

:: quick detour to Character Viewer ::

D'oh! UTF-8 E28099 = apostrophe = &rsquo;

childrene28099s-home-fills-patientse28099-needs = children's-home-fills-patients'-needs

Edit:
Even if you're doing it all in htaccess, you don't need three rulesets, because you can start with

^(?:news/news-archives/)?blahblah-including-e28099

and carry on as before. Just make sure the e28099 stuff comes before the rule involving "news/news-archive/" by itself.

[N] is icky because it doesn't just mean "keep executing this same rule over and over again until it rinses clean", it means "go all the way back to the beginning and do all your mod_rewrite stuff over again, including stuff you've already tested for that can't possibly have changed".
5:40 pm on Sep 1, 2013 (gmt 0)



oh oh oh! This seems to work!

RewriteRule ^news/news-archives/(.*)e28099(.*)e28099(.*)$ http\:\/\/example\.org\/$1$2$3 [R=301,L]

RewriteRule ^news/news-archives/(.*)e28099(.*)$ http\:\/\/example\.org\/$1$2 [R=301,L]

RewriteRule ^news/news-archives/(.*)?$ http\:\/\/example\.org\/$1? [R=301,L]


Line #1 replaces the phase if used 2x:
{ http://example.com/news/news-archives/childrene28099s-home-fills-patientse28099-needs }
http://example.com/childrens-home-fills-patients-needs

Line #2 replaces the phrase if used only 1x:
{ http://example.com/news/news-archives/childrene28099s-home-fills }
http://example.com/childrens-home-fills

Line #3 redirects a page without the phrase
{ http://example.com/news/news-archives/children }
http://example.com/children

[edited by: phranque at 12:58 am (utc) on Sep 2, 2013]
[edit reason] exemplified domain [/edit]

6:17 pm on Sep 1, 2013 (gmt 0)



Also, g1smd, thank you for the advice, I'll read up more about that
1:03 am on Sep 2, 2013 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld, blurped!


http\:\/\/example\.org\/$1$2$3

there is no need to escape anything in the Substitution string.
you should remove all those backslashes.
2:14 am on Sep 2, 2013 (gmt 0)



there is no need to escape anything in the Substitution string.
you should remove all those backslashes.

yea, I noticed that just earlier and have been cleaning it up. Thank you all for your help! These forums are great & I'll definitely start contributing (and asking lots of questions in this area :) Glad to have found ya'll!
4:04 pm on Sep 2, 2013 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Multiple .* within those patterns makes for really inefficient code. The .* means "capture the entire remainder of the string". This will force many thousands of "back off and retry" trial matches per request.

There's not a more efficient pattern you can use in place of .* hence why handing those requests off to a PHP solution is a much better scenario.

[edited by: incrediBILL at 10:33 pm (utc) on Sep 4, 2013]
[edit reason] fixed typo [/edit]

11:20 pm on Sep 2, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Do the URLs ever contain numerals? If not,

([a-z-]+)e28099

et cetera will reduce hiccuping. Since the URLs are no longer being actively created, you know how many e28099 there are. Is two the absolute ceiling?

You can say [a-z-]* for the last capture (the package after the final e28099) but the others should be + since you will never have two consecutive apostrophes, and never a leading one. Well, unless you've got a page whose name started out as
'tain't so
or similar.
9:33 pm on Sep 4, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



This was in htaccess on a production server? I missed that part :(
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month