Forum Moderators: phranque

Message Too Old, No Replies

AOL browser sends '&' as '&' in URL

"&pageid=1234" is sent as "&pageid=1234"

         

phunky

3:22 am on Nov 20, 2008 (gmt 0)

10+ Year Member



Hi, im new to this community.

i've noticed a thread that describes a problem that is very similar to what i am experiencing:

[webmasterworld.com...]

i know nothing about the htaccess, except for defining custom error pages :) well, here's my problem:

it seems that the AOL browser is messing with my URLs, as the server is receiving '&'s instead of '&'s. i believe it is just aol causing the problem, because this same user browsed to my site in IE and it worked fine.

i believe this would be a simple fix, using the capabilities of the htaccess file.

does anyone think that a solution would be as simple as to replace any '&'s with '&'s, and then redirect the result?

if not, could anyone point me in the right direction?

jdMorgan

4:46 am on Nov 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's not quite a simple fix, because the tools available in .htaccess are not designed for "looping through" URLs and making multiple corrections efficiently.

For the sake of discussion, what is the maximum number of "&" character sequences you might receive in a request?

Jim

phunky

5:00 am on Nov 20, 2008 (gmt 0)

10+ Year Member



thanks jim

anywhere from 1 - 9.

9 is a generous estimate. it should not come to 9 except in a few very special circumstances, however i would like to account for 9 occurrences.

the discrepancy occurs after the ? in the URL... but you probably already assumed that

/Rich

caribguy

6:31 am on Nov 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Before jumping to conclusions: an issue like this should affect many webmasters. I normally seee things like;

www.example.com aaa.bbb.ccc.ddd - - [14/Nov/2008:12:53:22 -0600] "GET /folder/file.html?var1=foo&var2=bar HTTP/1.1" 200 6399 "http://www.example.com/index.html?var3=true&var4=false" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

jdMorgan

6:32 am on Nov 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The query string name-value pairs are not part of a URL; They are data attached to the URL, to be passed to the resource (e.g. script) at that URL. The "?" is just a delimiter between the URL and the query data.

With that clarification in mind, are these & queries being passed to one URL on your site, or to many? If many, are there any commonalities between the URLs that these queries could be passed to? As a start, do they all end in ".php" or reside in the same "directory" or anything?

The code to fix this kind of problem is awfully inefficient -- maybe awesomely-inefficient would be a more accurate description. Therefore it is critical to avoid running the code unless it's absolutely necessary. You don't want to be running it for every request to your server, or you're may need to upgrade to a very-high-end dedicated server very soon!

So I'm looking for some URL descriptions here... Things like, "The & query strings could be requested from

  • any URL-path ending in ".php"
  • any URL-path ending in .php, but only in the "categories" directory path
  • any URL-path except for a few files that physically exist: robots.txt, labels.rdf, sitemap.xml, and index.php

    Something like any of those, and be aware I'm asking for URLs, not filepaths here. Be sure your answer is 100% accurate and 100% comprehensive, or we'll either go round and round in this thread, or your server will be crushed under the load of running the rules when it shouldn't have to...

    Jim

    [edited by: jdMorgan at 6:33 am (utc) on Nov. 20, 2008]

  • jdMorgan

    7:06 am on Nov 20, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    caribguy,

    I agree. This is either a bad point-release of the AOL browser, or it's an exploit. But in either case, it may need to be fixed, especially if these deformed queries get into search engines indexes.

    If it is simply a badly-written scraper robot spoofing AOL, I suppose we could just 403 it.

    It could also be badly-copied links -- For example, cutting a pasting a link from a source-view into a WYSIWYG editor would result in HTML entity-encoded characters in the URL. If so, then this does fall into the canonicalize-everything bin, because it could happen to anyone.

    Jim

    phunky

    7:45 am on Nov 20, 2008 (gmt 0)

    10+ Year Member



    my site is set up to run most every file through a 'site.php' in the root directory. this file acts as a filter.

    when a script is redirecting a user, the php will echo a complete html page. this page includes a 'meta-refresh' tag, an animated gif to display that the browser has not frozen, and a short message to the user explaining where they are going.

    this 'redirect' problem has only surfaced in aol, and only with this particular kind of page.

    an example of this html output would be:


    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>

    <snip>

    <meta http-equiv="refresh" content="1;url=[b]http://www.example.com/site.php?pageid=123467&login=Ja20TgBNFyX2Utor06K3DgV3HFBfGne6wng90CjMQj6aU202545so2001s906m254[/b]">

    <snip>

    </head>
    <body bgcolor="#000000">

    <snip>

    <p class="redir_verif" align="center">Please <a href="[b]http://www.example.com/site.php?pageid=123467&login=Ja20TgBNFyX2Utor06K3DgV3HFBfGne6wng90CjMQj6aU202545so2001s906m254[/b]">click here</a> if you are stuck on this page for more than 5 seconds.</p>

    <snip>

    </body>
    </html>

    The URLs in bold will always match.

    I have researched this situation, and have found it to be a fact, that AOL will replace any occurrences of an ampersand with '&amp;'.

    This would be fine, as long as it steers clear of any URLs!

    here are some example URLs that would be a problem (pre-AOL):

    ../site.php?&login=cC8wKlyu9W7Lgrv898300hFWB17kFjf728Q5vMqp37Xwz5Gzg65Gm7FvdrEnKFmD0&pageid=123467

    http://www.example.com/site.php?&pageid=1&login=cC8wKlyu9W7Lgrv898300hFWB17kFjf728Q5vMqp37Xwz5Gzg65Gm7FvdrEnKFmD0&p=123467

    http://www.example.com/site.php?pageid=13&x=7&y=1&login=cC8wKlyu9W7Lgrv898300hFWB17kFjf728Q5vMqp37Xwz5Gzg65Gm7FvdrEnKFmD0#top

    I hope this clarifies things...

    /Rich

    [edited by: jdMorgan at 3:17 pm (utc) on Nov. 20, 2008]
    [edit reason] Changed to example.com, removed irrelevant HTML. [/edit]

    phunky

    12:08 pm on Nov 20, 2008 (gmt 0)

    10+ Year Member



    PS im not sure of the difference between a filepath and a URL... im clear on a filepath, but what distinguishes a URL? thats neither here nor there...

    jdMorgan

    3:38 pm on Nov 20, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    A URL is used on the Web to locate a resource.
    A filepath is used in the server's filesystem to locate a file.
    A resource is not necessarily a file, unless that resource is a static image or hard-coded HTML page.
    A resource (for example an HTML Web page) could actually be created by a file with name totally unrelated to the URL (for example, a single PHP script that creates many of the "pages" on a site).
    Therefore, a URL and a filepath are "associated" but not equivalent.
    The basic function of a Web server is to translate a requested URL into a file request compatible with the file system of the operating system that runs that server, and to do so in a manner that is invisible to Web users.

    If we used filepaths as links on the Web, it would be necessary for everyone to know the disk drive, Apache install directory, Web site account directory, and filename for each "thing" on the Web site they wanted to see. The links would have change if the server was moved from IIS to Apache, or vice versa -- for one thing, all the backslashes would have to be changed to slashes. Hosting companies would be unable to move a site from one server to another to improve performance of a busy site; If they did, all the links on the site might have to change.

    OK, we will proceed on the assumption that this is not some hacked or improperly-constructed Web page that has been indexed by Google and included in AOL's search results. Give me a few minutes to figure out how to best address the character-string substitution problem...

    Jim

    jdMorgan

    4:30 pm on Nov 20, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Because we are a discussion forum and our limited number of contributors cannot support a "help desk" or "free coding service" function, we do not usually post code solutions here, except as suggested corrections to code posted by the thread originator. But since this looks a problem that could happen to anyone, it becomes another one of our many "URL canonicalization" problems, and many people may therefore need a solution.

    So I would suggest adding something like the following to a .htaccess file with previously-tested-and-working RewriteRules in it:


    # Skip this awfully-slow code section if the URL is not one that will legitimately have
    # a query string appended, or if no "&amp;" sequences are present in the query string
    RewriteCond $1 !^page-URL-path [OR]
    RewriteCond %{QUERY_STRING} !&amp;
    RewriteRule (.*) - [S=5]
    #
    # Fix-up a few &amp; sequences with each rule, set
    # the AmpFix variable if any rule is invoked
    RewriteCond %{QUERY_STRING} ^(.*)&amp;(.*)&amp;(.*)&amp;(.*)$
    RewriteRule ^(page-URL-path)$ $1?%1&%2&%3&%4 [E=AmpFix:Yes]
    #
    RewriteCond %{QUERY_STRING} ^(.*)&amp;(.*)&amp;(.*)&amp;(.*)$
    RewriteRule ^(page-URL-path)$ $1?%1&%2&%3&%4 [E=AmpFix:Yes]
    #
    RewriteCond %{QUERY_STRING} ^(.*)&amp;(.*)&amp;(.*)$
    RewriteRule ^(page-URL-path)$ $1?%1&%2&%3 [E=AmpFix:Yes]
    #
    RewriteCond %{QUERY_STRING} ^(.*)&amp;(.*)$
    RewriteRule ^(page-URL-path)$ $1?%1&%2 [E=AmpFix:Yes]
    #
    # If one or more of the above rules was invoked and set the AmpFix
    # variable, do an external 301 redirect to correct the query string
    RewriteCond %{ENV:AmpFix} ^Yes$
    RewriteRule ^(page-URL-path)$ http://www.example.com/$1 [R=301,L]

    I have intentionally broken this down into four rules with a total of nine &amp; fix-ups, instead of trying to code one rule for each case of fixing eight through one occurrences of &amp; and an additional two-rule set to fix nine occurrences, all of which would be just awesomely slow. I also avoided using the [N] recursion function of mod_rewrite, which is also awesomely slow unless used in a tiny .htaccess file with very few preceding directives.

    Even this "optimized" approach is not very efficient. Taking the first rule as an example, and a query string with exactly three &amp; character sequences in it (for the sake of simplicity of description), this is because the first ".*" subpattern is going to initially match the entire query string, and then it will "back off" one character at a time until it frees-up the last &amp; sequence in the requested query string that it's looking at, and finds match with the first "&amp;" subpattern in the pattern. But then the next subpattern will fail, and force another character-by-character back-off sequence, resulting in a match on the first two "(.*)&amp;" subpatterns, but a failure of the third. Finally, after yet another back-off sequence, all four ".*" subpatterns and all three "&amp;" subpatterns will match.

    The code above refers to "page-URL-path" in every rule. This should be modified to a pattern which matches any and all URLs which use query strings and which you feel could be corrupted. Since you did not refer to any particular URL or set of URLs, this is simply a "placeholder" for a pattern that matches your needs. It will not be necessary to match URLs where the query string will have no effect, and it is not necessary to match anything but "pages"; We specifically want to avoid running these rules for image, external CSS and JavaScript file, robots.txt, sitemap.xml, lables.rdf and other requests which won't ever use a query string.

    Actually, a robust solution would be to separately detect requests for any of those non-script URLs with any query string attached, and generate a 301 redirect to remove those spurious query strings.

    Warning: This code is not tested. The multiple sequential rewrites may themselves trigger a problem. If you start to see "mysterious repeats" of parts of the URL-path appearing as the URL, stop and let me know. This problem is caused by a long-known but still un-patched Apache bug that exists in all versions of Apache that I've tested (I've tested Apache 1.3.1 up to Apache 2.2). If we have to use the work-around for this known bug, this may become the slowest, ugliest code I've ever posted... :(

    Jim

    jdMorgan

    6:09 pm on Nov 20, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Please read the post above before reading this one... Thanks!

    Ok, here is the version of the code that will avoid the nasty Apache mod_rewrite bug [archive.apache.org] I mentioned. I am not sure whether the code posted above will trigger that bug, but if it does, this will avoid it:


    ## Redirect to fix corrupted query strings containing up to ten
    ## "&amp;" HTML character-entities instead of "&" characters
    #
    # Skip this awfully-slow routine if the URL is not one that
    # will legitimately have a query string appended, or if no
    # "&amp;" sequences are present in the query string
    RewriteCond $1 !^page-URL-path$ [OR]
    RewriteCond %{QUERY_STRING} !&amp;
    RewriteRule (.*) - [S=6]
    #
    # Get the query string to a server variable, fixing one
    # "&amp;" sequence while we're at it.
    RewriteCond %{QUERY_STRING} ^(.*)&amp;(.*)$
    RewriteRule .* - [E=Qstring:%1&%2]
    #
    # Fix-up a few &amp; sequences with each rule, using the Qstring variable and
    # leaving the URL unmodified in order to avoid the Apache URL-corruption bug
    # otherwise triggered by sequential rewrites
    RewriteCond %{ENV:Qstring} ^(.*)&amp;(.*)&amp;(.*)&amp;(.*)$
    RewriteRule .* - [E=Qstring:%1&%2&%3&%4]
    #
    RewriteCond %{ENV:Qstring} ^(.*)&amp;(.*)&amp;(.*)&amp;(.*)$
    RewriteRule .* - [E=Qstring:%1&%2&%3&%4]
    #
    RewriteCond %{ENV:Qstring} ^(.*)&amp;(.*)&amp;(.*)$
    RewriteRule .* - [E=Qstring:%1&%2&%3]
    #
    RewriteCond %{ENV:Qstring} ^(.*)&amp;(.*)$
    RewriteRule .* - [E=Qstring:%1&%2]
    #
    # Now do an external 301 redirect to correct the query string
    RewriteRule (.*) http://www.example.com/$1?%{Qstring} [R=301,L]
    #
    # If the requested URL does not have a corrupted query string
    # appended, we will skip to the rule that follows this line
    #

    Due to the nature of the required changes from the previously-posted code, this code can actually fix ten "&amp;" character-entities in a query string. However, the code will be much faster if fewer character-entity corrections are provided-for. I suggest providing only for the maximum number of character-entities that might actually be received with an otherwise-valid request; Requests with unexpected query strings or other invalid characteristics should be corrected or rejected out-of-hand before this routine can be invoked.

    Jim