homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Question about %3F and %3D embedded in inbound links
The "?" and "=" signs are encoded on some inbound links.
maximillianos




msg:4138121
 1:54 am on May 25, 2010 (gmt 0)

I was hoping I could ask a question regarding a problem I'm having trouble resolving. I've noticed quite a few inbound links to my site that contain "%3F" and "%3D", etc in place of the "?" and the "=" signs. The links seem to fail since apache doesn't translate them properly.

I've tried a number of rewrites but haven't had any luck. My latest attempt (which I got from another site):

# Rewrite up to two instances of "%xx" to "?" & "=" in URL and do a 301-Moved Permanently redirect.
RewriteRule ([^\ ]+)\ ([^\ ]+)\ (.+) http://www.example.com/$1?$2=$3 [R=301,L]


This did not seem to do anything. At least in my implementation.

Does anyone know of a way to redirect/rewrite them to replace the %3F with a "?" and the "%3D" with an "="... ?

Interestingly, most of these "broken" inbound links are from scraper sites. Kind of makes me wonder if I should even bother fixing them, but I figure a link is a link right? =)

Thanks for any advice!

[edited by: jdMorgan at 1:28 pm (utc) on May 26, 2010]
[edit reason] example.com [/edit]

 

wesmaster




msg:4138125
 2:08 am on May 25, 2010 (gmt 0)

I did some research on this about a month ag, I have the same problem, and everything I found said it's not possible. Seemed ridiculous, but the techs at Rackspace confirmed. If you find a solution post it here.

jdMorgan




msg:4138149
 2:56 am on May 25, 2010 (gmt 0)

Impossible for whom? Not our gang here, mate! :)

After thinking about this, I realized that the code posted in several previous threads here is not suitable for the problem as described, which is that the "?" can be encoded as well as the "=" and "&" characters.

When the "?" used to demarcate a query string is encoded, Apache will treat the whole thing as a URL, and consider the query string to be empty. In other words, the "%2f" encoded question mark and anything following it is considered to be part of the so-called "filename" itself, and not a query string to be passed to the script at that location.

So, the solution must convert a URL-only containing an encoded question mark into a URL-plus-query-string, and in HTTP, only the query string is permitted to contain un-encoded "=" and "&" characters.

Furthermore, if more than one encoded "?" is present, only the first one is permitted to be decoded -- more than one un-encoded "?" would cause trouble (as evidenced by a concurrent thread).

The way this code is set up here, it will first look for an un-encoded "?" demarcating a 'real' query string. If one is found, then all encoded "=" and "&" characters that follow it will be decoded.

If no un-encoded "?" is found, then the code looks for the first "%2f" or "%2F" in the URL-path. If one is found, then it is replaced with a "?". and we pick up as in the previous paragraph.

Multiply-encoded %2f, %3d, and %26 characters (examples: %252F, %25253d, %2526) are decoded as well.

Meeting all of these requirements is kind of a tough problem, but something like this might work better than the code posted previously. This is fresh-typed and un-tested code, so test with caution and please post results back here.

# If THE_REQUEST contains a URL with a percent-encoded "?" and/or a query string with
# one or more specific percent-encoded characters and we're not already fixing it,
# then copy the client-requested URL-plus-query-string into the "MyURI" variable.
RewriteCond %{ENV:MyURI} =""
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?\ ]*\?[^%\ ]*\%(25)*(3[Dd]|26)[^\ ]*)\ HTTP/ [OR]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^%\ ]*\%(25)*([^3].|.[^Ff]))*[^%\ ]*\%(25)*3[Ff][^\ ]*)\ HTTP/
RewriteRule ^. - [NE,E=MyURI:%1]
#
# If any encoded question mark is present in the client-requested URI, and
# no unencoded question mark is present, replace the first encoded question
# mark first, queue up a redirect, and then re-start mod_rewrite processing
RewriteCond %{ENV:MyURI} ^[^?]+$
RewriteCond %{ENV:MyURI} ^([^%]*(\%(25)*([^3].|.[^F]))*[^%]*)\%(25)*3F(.*)$ [NC]
RewriteRule ^. - [NE,E=MyURI:%1?%6,E=QRedir:Yes,N]
#
# If any encoded "=" sign follows the "?", replace it, queue
# up a redirect, and re-start mod_rewrite processing
RewriteCond %{ENV:MyURI} ^(([^?]*\?([^%]*\%(25)*([^3].|.[^D]))*)[^%]*)\%(25)*3D(.*)$ [NC]
RewriteRule ^. - [NE,E=MyURI:%1=%7,E=QRedir:Yes,N]
#
# If any encoded ampersand follows the "?", replace it, queue
# up a redirect, and then re-start mod_rewrite processing
RewriteCond %{ENV:MyURI} ^(([^?]*\?([^%]*\%(25)*([^2].|.[^6]))*)[^%]*)\%(25)*26(.*)$
RewriteRule ^. - [NE,E=MyURI:%1=&7,E=QRedir:Yes,N]
#
# If we get here, there are no more percent-encoded characters which can
# and should be replaced by the rules above, so do the external redirect
RewriteCond %{ENV:QRedir} =Yes [NC]
RewriteRule ^. http://www.example.com/%{ENV:MyURI} [R=301,L]

Note that the complex regular expressions are required to be able to replace only the first "?", to replace only encoded characters following that "?", and to allow parsing the request line in a single left-to-right pass. I apologize to the many fans of multiple-dot-star regex patterns, as using that kind of pattern would indeed make the code look a lot simpler. But it would also make it even more horribly slow...

Note also that the first rule could just as well rewrite these requests to a small script to do the un-encoding and 301 redirect. If more than just a few encoded links are pointed at your site, this would be worthwhile, as mod_rewrite isn't really well-suited to this kind problem, since it has to be completely re-started to make it 'loop' in order to do multiple replacements.

This code should be placed as close as possible to the top of your pre-existing mod_rewrite code for better (but still not very good) efficiency.

Jim

[edit] Corrections as noted below. [/edit]
Please see updated version below!

[edited by: jdMorgan at 5:56 am (utc) on May 27, 2010]

maximillianos




msg:4138166
 3:36 am on May 25, 2010 (gmt 0)

Wow. Lot to digest here. Thanks Jim! I'll see what this does!

jdMorgan




msg:4138185
 4:32 am on May 25, 2010 (gmt 0)

It'll probably have errors -- I won't have time to test it today, but I figured you might make some progress anyway... or maybe we'll both be lucky and it might work without any/many changes.

Jim

maximillianos




msg:4138310
 11:43 am on May 25, 2010 (gmt 0)

There are a few lines I'll need to figure out and update first. My encoded characters are %3F and %3D, and I think you are using %2F and %3D.

I know, should be a simple replace, but I'm still trying to figure out what is going on. I've done a lot of basic rewriting, but nothing close to this complex.

I should be able to test it more today and let you know how it goes.

Thanks again Jim!

maximillianos




msg:4138469
 3:08 pm on May 25, 2010 (gmt 0)

I am having trouble trying to get the above to work. I don't get any errors, it is just that nothing happens when I load a page with the encoded characters in the URL. I just end up at my 404 not found page.

Which got me thinking, I know it is typically not recommended to 301 meta refresh redirect from a 404 page, so what if we simply checked the "THE_REQUEST" variable for an encoded "?" (ie - %3F) and then just redirected to a page/script that would then analyze the URL and do the meta-refresh 301 redirect?

Would this be easier and more efficient from a performance perspective? Or is it just a hack? =)

Let me know what you guys think.

Thanks!

jdMorgan




msg:4138510
 3:31 pm on May 25, 2010 (gmt 0)

It's a hack. A bad one. Figure out the real problem and fix that.
Don't ever do hacks -- They can put you out of business -- Fast.

Sorry, I've been told I'm rather opinionated... :)

---

What, did I get the encoded "?" character value wrong? Let me look at this a bit later today and correct it if so. None of it will work if that "?" value isn't right.

Jim

jdMorgan




msg:4138530
 3:41 pm on May 25, 2010 (gmt 0)

Sorry, I certainly got the encoded value of "?" wrong -- I spent too much time on the logic and the code, and not enough on looking up the encoded values... :)

I corrected the code above to prevent anyone else tripping on this problem -- The whole rule-set depends on finding the first "?" or "%3F" in the requested URL-path, and it certainly can't work if that match-value is wrong - as it was.

So, now we can move on to finding my next typo... ;)

Jim

maximillianos




msg:4138883
 6:27 pm on May 25, 2010 (gmt 0)

You are right. It would be a hack. =)

I tried the code again, no luck. It seems to be skipped/not processed. I tried pulling out the %25 stuff to simplify it, but I was not able to get it to compile after doing so.

I'll play around with it some more and report back.

jdMorgan




msg:4139169
 8:51 pm on May 25, 2010 (gmt 0)

Let's make sure we're back on the same page here...

How about posting an actual requested URL that clearly shows the "bad link" format -- in domain "example.com" of course? Give me "the worst of the worst" malformed links you've seen... :)

Jim

maximillianos




msg:4139205
 9:17 pm on May 25, 2010 (gmt 0)

Here is one of the worst cases, it has the "?" encoded along with two "=" signs:

http://www.example.com/myPage.cgi%3Fcid%3D483&sid%3D3

I want to translate the above to:

http://www.example.com/myPage.cgi?cid=483&sid=3

Ugly ain't it? ;-)

jdMorgan




msg:4139288
 10:06 pm on May 25, 2010 (gmt 0)

OK, fix the second RewriteCond in the second rule as corrected in my post above -- The parentheses nesting was wrong.

I'll wager it works better now... :)

Jim

maximillianos




msg:4140277
 4:02 pm on May 26, 2010 (gmt 0)

Unfortunately I'm at a loss due to my novice-level understanding of the reg-ex syntax. It seems to be ignoring it, which may mean one of the conditions is not matching up right.

I'm going to try implementing it one piece at a time and see if I can get any of the individual pieces to work.

Let you know how it goes!

maximillianos




msg:4140316
 4:19 pm on May 26, 2010 (gmt 0)

Here is what I've got so far:

The first block works:


# If THE_REQUEST contains a URL with a percent-encoded "?" and/or a query string with
# one or more specific percent-encoded characters and we're not already fixing it,
# then copy the client-requested URL-plus-query-string into the "MyURI" variable.
RewriteCond %{ENV:MyURI} =""
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?\ ]*\?[^%\ ]*\%(25)*(3[Dd]|26)[^\ ]*)\ HTTP/ [OR]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^%\ ]*\%(25)*([^3].|.[^Ff]))*[^%\ ]*\%(25)*3[Ff][^\ ]*)\ HTTP/
# OUT FOR TEST -- RewriteRule ^. - [NE,E=MyURI:%1]
RewriteRule ^. http://www.example.com [R=301,L]


I am able to get through it and redirect to a test page okay.

However from this point, I cannot get any of the remaining blocks of code to work. I tried isolating each of the next two blocks that check for and replace the encoded "%3F" and the "%3D", but both seem to get skipped.

Any ideas?

jdMorgan




msg:4140789
 8:40 pm on May 26, 2010 (gmt 0)

Arrgh!

The first RewriteCond of the second rule was wrong -- I forgot to change it while making initial "improvements."

See correction above.

Jim

maximillianos




msg:4141008
 11:32 pm on May 26, 2010 (gmt 0)

Great work Jim. It works like a charm! So far it has correctly re-directed every link I tested. Nice!

Now the question, should I be allowing links from .ru sites, and scraper sites, etc? On one hand, I feel if they are going to scrape me, why not get a link back right? On the other hand, will it hurt my rankings if G sees all these spammy sites linking to me?

That is a question for another day perhaps. Great work and thanks so much! I owe you a beer!

jdMorgan




msg:4141145
 1:50 am on May 27, 2010 (gmt 0)

Just a heads-up that I found a few shortcomings in this code for general-purpose use. It may work fine for your specific URLs, but I am making some improvements to make it much more robust and more efficient.

I hope to post a revised version within 23 hours or so -- a lot of testing is needed.

Jim

jdMorgan




msg:4141354
 5:50 am on May 27, 2010 (gmt 0)

OK, here's an improved version of the code. However, one thing to note is that in many cases, it should not be necessary to replace anything except encoded "?" characters in the URI. Many if not most scripts will readily accept encoded "=" and "&" characters.

The encoded "?" is really the major problem, because it prevents the intended query string from being recognized as a query string. Instead, it is seen as part of the URL-path -- The "filepath" if we use that term loosely.

So the main reason you might want to redirect URI requests with "=" and "&" characters encoded would be mainly an SEO consideration; Redirecting them would "fix" search engine listings that had resulted from links with those characters encoded, and the redirect would reduce the number of links to your site with encoded links -- at least you'd get less from Webmasters who test their outbound links before publishing them, and who notice that you redirect the request to a 'cleaned-up' URI.

Anyway, if you don't need or want to go through the bother of un-encoding the "=" and "&" characters, then a much shorter and much more efficient solution is possible. Just three lines of code:

# If an encoded "?" is present in the requested URI, and no unencoded "?" is
# present, then externally redirect to replace the encoded "?" character.
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?\ ]+)\ HTTP/
RewriteCond %1 ^(([^%]*(\%(25)*([^3].|.[^F]))*)*)\%(25)*3F(.*)$ [NC]
RewriteRule ^. http://www.example.com/%1?%7 [NE,R=301,L]


That said, here's the new version, which is much more robust than the previous one as far as recognizing when it should act. The previous version could miss a few cases under certain circumstances.

# If THE_REQUEST contains a URL-path with a percent-encoded "?" and/or a query string with one
# or more specific percent-encoded characters, and we're not already in the process of fixing
# it, then copy the client-requested URL-path-plus-query-string into the "MyURI" variable.
RewriteCond %{ENV:MyURI}>%{THE_REQUEST} ^>[A-Z]+\ /([^\ ]+)\ HTTP/
RewriteCond %1 ^([^?]*\?([^%]*(\%(25)*([^3].|.[^D]))*)*\%(25)*3D.*)$ [NC,OR]
RewriteCond %1 ^([^?]*\?([^%]*(\%(25)*([^2].|.[^6]))*)*\%(25)*26.*)$ [OR]
RewriteCond %1 ^(([^%]*(\%(25)*([^3].|.[^F]))*)*\%(25)*3F.*)$ [NC]
RewriteRule ^. - [NE,E=MyURI:%1]
#
# If any encoded question mark is present in the client-requested URI, and
# no unencoded question mark is present, replace the first encoded question
# mark, queue up a redirect, and then re-start mod_rewrite processing
RewriteCond %{ENV:MyURI} ^[^?]+$
RewriteCond %{ENV:MyURI} ^(([^%]*(\%(25)*([^3].|.[^F]))*)*)\%(25)*3F(.*)$ [NC]
RewriteRule ^. - [NE,E=MyURI:%1?%7,E=QRedir:Yes,N]
#
# If any encoded "=" sign follows the "?", replace it, queue
# up a redirect, and re-start mod_rewrite processing
RewriteCond %{ENV:MyURI} ^([^?]*\?([^%]*(\%(25)*([^3].|.[^D]))*)*)\%(25)*3D(.*)$ [NC]
RewriteRule ^. - [NE,E=MyURI:%1=%7,E=QRedir:Yes,N]
#
# If any encoded ampersand follows the "?", replace it, queue
# up a redirect, and then re-start mod_rewrite processing
RewriteCond %{ENV:MyURI} ^([^?]*\?([^%]*(\%(25)*([^2].|.[^6]))*)*)\%(25)*26(.*)$
RewriteRule ^. - [NE,E=MyURI:%1&%7,E=QRedir:Yes,N]
#
# If we get here, there are no more percent-encoded characters which can
# and should be replaced by the rules above, so do the external redirect
RewriteCond %{ENV:QRedir} =Yes [NC]
RewriteRule ^. http://www.example.com/%{ENV:MyURI} [NE,R=301,L]

Jim

maximillianos




msg:4142069
 4:57 pm on May 27, 2010 (gmt 0)

Good observation regarding the main issue being the encoded "?". You are correct about that... I tested it. The scripts now accept the request and hand it off to the proper file, but in this scenario my programs would need to be tweaked to parse the query string properly with the encoded characters OR the real delimiters.

Since it kind of poses a dup content issue, I am leaning towards the original solution of just 301'ing them all to the same URL structure.

Great work Jim. I'm learning a lot. Thank you.

kidcobra




msg:4204969
 8:11 pm on Sep 21, 2010 (gmt 0)

Hi Jim. You helped me out with a headache a long time ago, and also encouraged me to fix some related issues, which I did spend a rather long day doing, and your time and help was greatly appreciated. I read this thread with great interest, because our site does get encoded incoming links that turn 404's. I did a some quick testing of the long solution above on some incoming that Google had picked up over the past 6 months as 404's, and it appears to work seamlessly and perfectly. I just wanted to ask you if there have been any reports of any kinds of extraneous or other issues or problems from anyone using the long solution above. Just figured to check with you before I leave it in htaccess for good. Thanks Jim,

Greg

wreilly202




msg:4259744
 12:17 am on Jan 29, 2011 (gmt 0)

Jim I have been using your stuff for some time and cannot thank you enough. And this is a masterpiece thank you!

And naturally I need to modify it and I don't pretend to understand it...yet.. Could this be limited to just a single uri like www.foo.com/bar/...

jdMorgan




msg:4260911
 1:36 am on Feb 1, 2011 (gmt 0)

Yes, but due to the complexity of the rule-set, the fact that I don't like to tweak complicated code after it's been widely-vetted, and the ambiguity of your "single URI" above, I'd suggest simply preceding the code above with a simple rule that skips the following five rules if the requested URL-path does NOT match the one you want to fix.

RewtriteRule !^the-URL-I-want-to-fix [S=5]

Jim

wreilly202




msg:4260929
 2:09 am on Feb 1, 2011 (gmt 0)

Thank you Jim that is perfect.

incrediBILL




msg:4289367
 4:58 am on Mar 30, 2011 (gmt 0)

Some things change the '+' in parms to %2520, such as "United+States" becomes "United%2520States" which could be added to this too for a more comprehensive version :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved