Forum Moderators: phranque
i've noticed a thread that describes a problem that is very similar to what i am experiencing:
[webmasterworld.com...]
i know nothing about the htaccess, except for defining custom error pages :) well, here's my problem:
it seems that the AOL browser is messing with my URLs, as the server is receiving '&'s instead of '&'s. i believe it is just aol causing the problem, because this same user browsed to my site in IE and it worked fine.
i believe this would be a simple fix, using the capabilities of the htaccess file.
does anyone think that a solution would be as simple as to replace any '&'s with '&'s, and then redirect the result?
if not, could anyone point me in the right direction?
www.example.com aaa.bbb.ccc.ddd - - [14/Nov/2008:12:53:22 -0600] "GET /folder/file.html?var1=foo&var2=bar HTTP/1.1" 200 6399 "http://www.example.com/index.html?var3=true&var4=false" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
With that clarification in mind, are these & queries being passed to one URL on your site, or to many? If many, are there any commonalities between the URLs that these queries could be passed to? As a start, do they all end in ".php" or reside in the same "directory" or anything?
The code to fix this kind of problem is awfully inefficient -- maybe awesomely-inefficient would be a more accurate description. Therefore it is critical to avoid running the code unless it's absolutely necessary. You don't want to be running it for every request to your server, or you're may need to upgrade to a very-high-end dedicated server very soon!
So I'm looking for some URL descriptions here... Things like, "The & query strings could be requested from
Something like any of those, and be aware I'm asking for URLs, not filepaths here. Be sure your answer is 100% accurate and 100% comprehensive, or we'll either go round and round in this thread, or your server will be crushed under the load of running the rules when it shouldn't have to...
Jim
[edited by: jdMorgan at 6:33 am (utc) on Nov. 20, 2008]
I agree. This is either a bad point-release of the AOL browser, or it's an exploit. But in either case, it may need to be fixed, especially if these deformed queries get into search engines indexes.
If it is simply a badly-written scraper robot spoofing AOL, I suppose we could just 403 it.
It could also be badly-copied links -- For example, cutting a pasting a link from a source-view into a WYSIWYG editor would result in HTML entity-encoded characters in the URL. If so, then this does fall into the canonicalize-everything bin, because it could happen to anyone.
Jim
when a script is redirecting a user, the php will echo a complete html page. this page includes a 'meta-refresh' tag, an animated gif to display that the browser has not frozen, and a short message to the user explaining where they are going.
this 'redirect' problem has only surfaced in aol, and only with this particular kind of page.
an example of this html output would be:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><snip>
<meta http-equiv="refresh" content="1;url=[b]http://www.example.com/site.php?pageid=123467&login=Ja20TgBNFyX2Utor06K3DgV3HFBfGne6wng90CjMQj6aU202545so2001s906m254[/b]">
<snip>
</head>
<body bgcolor="#000000">
<snip>
<p class="redir_verif" align="center">Please <a href="[b]http://www.example.com/site.php?pageid=123467&login=Ja20TgBNFyX2Utor06K3DgV3HFBfGne6wng90CjMQj6aU202545so2001s906m254[/b]">click here</a> if you are stuck on this page for more than 5 seconds.</p>
<snip>
</body>
</html>
The URLs in bold will always match.
I have researched this situation, and have found it to be a fact, that AOL will replace any occurrences of an ampersand with '&'.
This would be fine, as long as it steers clear of any URLs!
here are some example URLs that would be a problem (pre-AOL):
../site.php?&login=cC8wKlyu9W7Lgrv898300hFWB17kFjf728Q5vMqp37Xwz5Gzg65Gm7FvdrEnKFmD0&pageid=123467 http://www.example.com/site.php?&pageid=1&login=cC8wKlyu9W7Lgrv898300hFWB17kFjf728Q5vMqp37Xwz5Gzg65Gm7FvdrEnKFmD0&p=123467 http://www.example.com/site.php?pageid=13&x=7&y=1&login=cC8wKlyu9W7Lgrv898300hFWB17kFjf728Q5vMqp37Xwz5Gzg65Gm7FvdrEnKFmD0#top I hope this clarifies things...
/Rich
[edited by: jdMorgan at 3:17 pm (utc) on Nov. 20, 2008]
[edit reason] Changed to example.com, removed irrelevant HTML. [/edit]
If we used filepaths as links on the Web, it would be necessary for everyone to know the disk drive, Apache install directory, Web site account directory, and filename for each "thing" on the Web site they wanted to see. The links would have change if the server was moved from IIS to Apache, or vice versa -- for one thing, all the backslashes would have to be changed to slashes. Hosting companies would be unable to move a site from one server to another to improve performance of a busy site; If they did, all the links on the site might have to change.
OK, we will proceed on the assumption that this is not some hacked or improperly-constructed Web page that has been indexed by Google and included in AOL's search results. Give me a few minutes to figure out how to best address the character-string substitution problem...
Jim
So I would suggest adding something like the following to a .htaccess file with previously-tested-and-working RewriteRules in it:
# Skip this awfully-slow code section if the URL is not one that will legitimately have
# a query string appended, or if no "&" sequences are present in the query string
RewriteCond $1 !^page-URL-path [OR]
RewriteCond %{QUERY_STRING} !&
RewriteRule (.*) - [S=5]
#
# Fix-up a few & sequences with each rule, set
# the AmpFix variable if any rule is invoked
RewriteCond %{QUERY_STRING} ^(.*)&(.*)&(.*)&(.*)$
RewriteRule ^(page-URL-path)$ $1?%1&%2&%3&%4 [E=AmpFix:Yes]
#
RewriteCond %{QUERY_STRING} ^(.*)&(.*)&(.*)&(.*)$
RewriteRule ^(page-URL-path)$ $1?%1&%2&%3&%4 [E=AmpFix:Yes]
#
RewriteCond %{QUERY_STRING} ^(.*)&(.*)&(.*)$
RewriteRule ^(page-URL-path)$ $1?%1&%2&%3 [E=AmpFix:Yes]
#
RewriteCond %{QUERY_STRING} ^(.*)&(.*)$
RewriteRule ^(page-URL-path)$ $1?%1&%2 [E=AmpFix:Yes]
#
# If one or more of the above rules was invoked and set the AmpFix
# variable, do an external 301 redirect to correct the query string
RewriteCond %{ENV:AmpFix} ^Yes$
RewriteRule ^(page-URL-path)$ http://www.example.com/$1 [R=301,L]
Even this "optimized" approach is not very efficient. Taking the first rule as an example, and a query string with exactly three & character sequences in it (for the sake of simplicity of description), this is because the first ".*" subpattern is going to initially match the entire query string, and then it will "back off" one character at a time until it frees-up the last & sequence in the requested query string that it's looking at, and finds match with the first "&" subpattern in the pattern. But then the next subpattern will fail, and force another character-by-character back-off sequence, resulting in a match on the first two "(.*)&" subpatterns, but a failure of the third. Finally, after yet another back-off sequence, all four ".*" subpatterns and all three "&" subpatterns will match.
The code above refers to "page-URL-path" in every rule. This should be modified to a pattern which matches any and all URLs which use query strings and which you feel could be corrupted. Since you did not refer to any particular URL or set of URLs, this is simply a "placeholder" for a pattern that matches your needs. It will not be necessary to match URLs where the query string will have no effect, and it is not necessary to match anything but "pages"; We specifically want to avoid running these rules for image, external CSS and JavaScript file, robots.txt, sitemap.xml, lables.rdf and other requests which won't ever use a query string.
Actually, a robust solution would be to separately detect requests for any of those non-script URLs with any query string attached, and generate a 301 redirect to remove those spurious query strings.
Warning: This code is not tested. The multiple sequential rewrites may themselves trigger a problem. If you start to see "mysterious repeats" of parts of the URL-path appearing as the URL, stop and let me know. This problem is caused by a long-known but still un-patched Apache bug that exists in all versions of Apache that I've tested (I've tested Apache 1.3.1 up to Apache 2.2). If we have to use the work-around for this known bug, this may become the slowest, ugliest code I've ever posted... :(
Jim
Ok, here is the version of the code that will avoid the nasty Apache mod_rewrite bug [archive.apache.org] I mentioned. I am not sure whether the code posted above will trigger that bug, but if it does, this will avoid it:
## Redirect to fix corrupted query strings containing up to ten
## "&" HTML character-entities instead of "&" characters
#
# Skip this awfully-slow routine if the URL is not one that
# will legitimately have a query string appended, or if no
# "&" sequences are present in the query string
RewriteCond $1 !^page-URL-path$ [OR]
RewriteCond %{QUERY_STRING} !&
RewriteRule (.*) - [S=6]
#
# Get the query string to a server variable, fixing one
# "&" sequence while we're at it.
RewriteCond %{QUERY_STRING} ^(.*)&(.*)$
RewriteRule .* - [E=Qstring:%1&%2]
#
# Fix-up a few & sequences with each rule, using the Qstring variable and
# leaving the URL unmodified in order to avoid the Apache URL-corruption bug
# otherwise triggered by sequential rewrites
RewriteCond %{ENV:Qstring} ^(.*)&(.*)&(.*)&(.*)$
RewriteRule .* - [E=Qstring:%1&%2&%3&%4]
#
RewriteCond %{ENV:Qstring} ^(.*)&(.*)&(.*)&(.*)$
RewriteRule .* - [E=Qstring:%1&%2&%3&%4]
#
RewriteCond %{ENV:Qstring} ^(.*)&(.*)&(.*)$
RewriteRule .* - [E=Qstring:%1&%2&%3]
#
RewriteCond %{ENV:Qstring} ^(.*)&(.*)$
RewriteRule .* - [E=Qstring:%1&%2]
#
# Now do an external 301 redirect to correct the query string
RewriteRule (.*) http://www.example.com/$1?%{Qstring} [R=301,L]
#
# If the requested URL does not have a corrupted query string
# appended, we will skip to the rule that follows this line
#
Jim