Forum Moderators: phranque

Message Too Old, No Replies

A bugfix to "Guide to fixing duplicate content & URL issues"

         

true_INFP

9:05 pm on Feb 22, 2010 (gmt 0)

10+ Year Member



The excellent mod_rewrite rule set described in the library thread "Guide to fixing duplicate content & URL issues" [webmasterworld.com] seems to incorrectly rewrite URLs that contain the percent symbol.

For example, if the visitor requests an existing page called "Document%20Index.htm" and there is a reason to rewrite the URL (such as the missing www. subdomain), the URL is rewritten to one that does not exist, i.e. "Document%2520Index.htm" (note the newly added string "25").

In some cases this even leads to a 301-recursion that ends with a Segmentation Fault (not sure if it is exploitable).

The fix appears to be quite simple:

Replace the line:
RewriteRule .? http://www.example.com%{ENV:myURI}%{ENV:myQS} [R=301,L]

with the following:
RewriteRule .? http://www.example.com%{ENV:myURI}%{ENV:myQS} [R=301,L,NE]

The only difference is the 'noescape' flag (NE) which prevents the percent symbol from being escaped.


However, I'm not sure this fix does not break something else. What do you think?

jdMorgan

2:46 am on Feb 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The %25 is an escaped "%" sign -- So the URL has been doubly-encoded. Typically, this isn't a problem since the server receiving the redirected request should fully-un-encode the request.

I never use characters in my URLs which HTTP/1.1 requires to be encoded, so I've never had to deal with this problem personally.

But if you're seeing problems, then the [NE] flag is a perfectly-valid fix.

Jim

true_INFP

4:10 pm on Feb 23, 2010 (gmt 0)

10+ Year Member



I never use characters in my URLs which HTTP/1.1 requires to be encoded, so I've never had to deal with this problem personally.


I didn't think the rule set was posted as a solution just for you personally. If the code is to be universal (for everyone who reads the library thread), which I supposed it was the case, it must support e.g. spaces in document names (which are perfectly valid).

The code doesn't do so and it causes rewrites to documents that do not exist and even segmentation faults in Apache 2.x. Therefore, it should be fixed (unless you see the fix breaks something else).

jdMorgan

4:46 pm on Feb 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I appreciate your contribution to that old (and now closed and archived) thread, and will certainly keep this issue in mind going forward. I mentioned my personal avoidance of reserved and "unwise" characters (as defined by the HTTP spec) only by way of explaining the fact that I did not provide for them in that code.

However, I certainly did not claim that code to be universal.

Thanks for pointing out this problem.

Jim

true_INFP

5:01 pm on Feb 23, 2010 (gmt 0)

10+ Year Member



(and now closed and archived)


Ok, I thought the library threads were something like 'Sticky' threads, sort of FAQ stuff...