Forum Moderators: phranque

Message Too Old, No Replies

Appended query strings in SERPs and reversing logspamming

         

Marcia

6:15 am on Feb 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are some who log spam using www.example.com/?theirsite.com as their links, but even worse, Yahoo indexes URLs with appended query strings and those get indexed with multiple variations with the true URL getting bumped out (or penalized). The real URLs also drop of Google.

I'd really like a RedirectMatch solution, but here's the closest I've found for a mod_rewrite fix, posted by jdMorgan in an older thread:

Jim's example

Options +FollowSymLinks
RewriteEngine on
RewriteCond %{query_STRING} ^aaaa=bbbb\.htm
RewriteRule ^index\.htm$ http://www.example.com/? [R=301,L]

Nope! Tried this and it gave a 500 server error

RewriteCond %{query_STRING} ^M=A
RewriteRule ^/widgets$ http://www.example.com/widgets? [R=301,L]

It'll be either /widgets/?N=M or (mostly) /widgets?A=D - many of them, and with two different seasonal directories: /gadgets/ and /widgets/ so it probably would do better with a generic fix, whichever directory it is.

I've tried at least 3 dozen variations, done other ways, and either it didn't work and I got the same URL returned or mostly 500 internal server error. I mean - at least 3 dozen tries today and more over the last year or two, since it killed those /subdirectories/

Is it the particular code, or could there be something else in or missing from .htaccess preventing it from working (I've had that happen).

I'll be killing the directories altogether and setting up subdomains to escape the issue for the time being, unless a fix is found but I suspect it's something deliberate being done, and it's bound to happen again and to other people also.

[edited by: Marcia at 6:22 am (utc) on Feb. 13, 2008]

phranque

7:41 am on Feb 13, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



an internal server error is usually accompanied by an entry in the server error log.

what happens if you get rid of the space after the ? in the substitution string?

Marcia

8:09 am on Feb 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What I'm getting now is just the same URL with the query string, with and without the space after the ?

g1smd

11:47 am on Feb 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My solution on a site I am working on now is to redirect URLs with the "correct" query strings in them (catering for all possible ordering of parameters), and have them all redirect to one canonical order, but as a static-looking URL.

So, ?a=111&b=22&c=3 and ?b=22&a=1&c=333 and ?c=333&b=22&a=1 (and another three formats) are ALL redirected to www.example.com/page/111/22/3/, forcing the www at the same time.

However, all URLs with extra parameters, missing parameters, or the wrong number of characters in the parameter value, are directly served the 404 page. The first parameter is always two digits. The second is always five digits. The third is always one digit. The site is already indexed with parameter URLs at present, so this migrates only "good" URLs to the new "static" format, and rejects all others. Those old URLs with "session IDs" are also rejected in the move (a separate set of rules strip the session ID out and redirects to the static "format"). Parameter-based URLs with included session IDs are those where people have cut and pasted a URL from their browser URL bar to a page on some other site. You only saw session IDs if you were logged in. They are now done with cookies instead.

For the "static" URLs, the 404 page is served if there are more or less than exactly three "parameter folders" (www.example.com/page/111/22/3/) present. If there are any normal parameters (?this.is.spam) on the end of the "static" URL, then a 301 redirect is forced to remove those parameters as well as force the www at the same time, too. If the number of characters in the URL is wrong, then the 404 page is also served.

Index files are redirected to remove the index file filename, and non-www is redirected to www.

Finally, there is a rewrite to connect the /page/111/22/3/ URL to the internal /somefile.php?a=111&b=22&c=3 server file path. The script itself does a check on the parameter values and if they are incorrect in any way it has its own 404 handling (say value "b" runs from "00001" to "15690" and you asked for "84360" for example).

Looking at the log files, it is apparent that there are a LOT of broken URLs in links pointing at the site. Looking at the Google site: command also shows many wierd responses, but as the new URLs are being indexed, there are no oddities noted with those. The "old" URLs are slowly dropping out. Yahoo is taking very much longer, but I am sure they will also get there in the end.

jdMorgan

3:00 pm on Feb 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For the case at hand, I'd suggest something like:

# Enable mod_rewrite
Options +FollowSymLinks
#
# Turn on the rewriting engine
RewriteEngine on
#
# If query string is bogus
RewriteCond %{QUERY_STRING} &?A=D&? [OR]
RewriteCond %{QUERY_STRING} &?M=A&? [OR]
RewriteCond %{QUERY_STRING} &?N=M&?
# and directory is /gadgets or /widgets, redirect to remove query string
RewriteRule ^((gadgets¦widgets).*)$ http://www.example.com/$1? [R=301,L]

The code above is specific to all URL-paths within two specific subdirectories, and three specific query string name/value pairs (although they may appear anywhere in the query string, as coded); The assumption is that only those three query string name/value pairs are invalid for those specific directory and filepaths, and that those query strings are valid for other directories, and that other query strings are valid for the URL-paths in the two specified subdirectories. It is also possible to modify the code to remove *all* query strings in these two subdirectories:

# Enable mod_rewrite
Options +FollowSymLinks
#
# Turn on the rewriting engine
RewriteEngine on
#
# If query string is non-blank
RewriteCond %{QUERY_STRING} .
# and directory is /gadgets or /widgets, redirect to remove query string
RewriteRule ^((gadgets¦widgets).*)$ http://www.example.com/$1? [R=301,L]

or to remove query strings on all resources within the site:


# Enable mod_rewrite
Options +FollowSymLinks
#
# Turn on the rewriting engine
RewriteEngine on
#
# If query string is non-blank
RewriteCond %{QUERY_STRING} .
# redirect to remove query string
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

Everything depends on precisely what URLs and query strings are involved, whether your site is dynamic, and if so, whether all URL-paths within your site are dynamic; Use the most-specific solution if in doubt.

Notes:

  • The Options line may be required, or it may not be needed, or it may not be allowed. This depends on how your host has configured your server, so the only way to find out is to test it. You can test this without any of the other code in this post to save time.
  • The RewriteEngine line (and the Options line above it) should only appear once in your .htaccess file, at the top of the file before any other mod_rewrite directives.
  • Replace the broken pipe "¦" characters with solid pipe characters before use; Posting on this forum modifies the pipe characters.
  • Completely flush your browser cache (Temporary Internet files if using IE) before testing any new or modified .htaccess code to avoid 'stale' cached responses.
  • If you get a 500-Server Error, then examine your server error log file. It will often tell you exactly what the problem is. If not already done, see the first note above. Hosting services which do not provide access to server error logs are not suitable for sites which use mod_rewrite or server-side scripting.
  • If other mod_rewrite rules are used within your .htaccess that add or modify query strings, then undesirable interactions (such as infinite rewrite/redirect loops) are possible. There are too many possible interactions to describe here, but generally, those other rules and the ones posted here must be modified so that they will not act upon each other's output.
  • When testing redirects, the "Live HTTP Headers" add-on for Firefox/Mozilla browsers can be used to show all HTTP client request and server response headers, often providing very useful information for troubleshooting.

    Jim

  • g1smd

    12:19 pm on Feb 17, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I've bookmarked this thread as a good example of code. Seems like one that can be referred to again, next time a similar question arises.

    Marcia

    5:58 am on May 16, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Just thought I'd follow up on this. I don't have any query strings on my own sites (yet), but the solutions here worked perfectly.

    Thanks again!

    g1smd

    11:02 pm on May 16, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Glad it helped.

    librarian

    1:45 pm on Sep 5, 2008 (gmt 0)

    10+ Year Member



    After much searching in WebmasterWorld and Google looking for a way to clean up some problems that have shown up in my error log file, I found this thread. Here are my problems.

    For some reason correct URLs show up with extra characters added after the .htm. Here are some examples:

    .htm >
    .htm!>!m<!
    .htm!>
    .htm!>
    .htm&_hash_;great+outdoors
    .htm&_hash_;sea
    .htm&familyfilter=1
    .htm_
    .htmindex.php

    What I would like to do is find a simple way in the .htaccess file to remove everything after the .htm so the visitor would go to the correct URL which is there.

    Recently another problem has appeared in the error log file which would seem to be similar and might be solved in the same way. The correct URL is there followed by a / and another file name. Here is an example:

    /directory/filename.htm/filename.htm

    The second file name is in the same directory but just needs to be removed with the /.

    I have searched and searched for examples how I might fix these problems. The information in this thread seems close to what might work but I just don't know if it will. I just don't have enough knowledge to know.

    Thanks for any help.

    jdMorgan

    5:04 pm on Sep 5, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Our focus here is on discussion and education, so we ask that you make an attempt to code a solution yourself -- See our Forum Charter for more information.

    However, I've also got a serious "anti-abuse" streak in me, and mal-formed links are often intentional efforts to mess up a site's search results listings, so I'll make an exception here and present a short "Defense Against the Dark Arts" session:

    For this specific case, where you have problems only with ".htm" files, the solution is easy, assuming you've got mod_rewrite or mod_alias installed, tested, and working on your server:

    mod_alias:

     RedirectMatch 301 ^(([^/]+/)*[^.]+\.htm).+$ http://www.example.com/$1

    mod_rewrite:
     RewriteRule ^(([^/]+/)*[^.]+\.htm).+$ http://www.example.com/$1 [R=301,L]

    The regular-expressions patterns in the above lines are built to be specific and to execute quickly, especially with longer URL-paths. An easier-to-understand but less-efficient pattern would be "^(.+\.htm).+$"

    However, since this shorter pattern is essentially un-anchored, many "trial" match attempts may be needed to match it, whereas the more complex pattern above can be match in a single left-to-right pass, except for the single internal loop looking for the last slash (if any) in the URL-path.

    For more information, see the documentation cited in our Forum Charter.

    Jim

    g1smd

    6:47 pm on Sep 6, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I think that is several orders more efficient than the code I currently use to eliminate a number of different issues.

    It will likely serve as a basis for further experimentation.

    librarian

    6:36 pm on Sep 10, 2008 (gmt 0)

    10+ Year Member



    I want to thank Jim very much for the help with removing text and characters that have been appearing after the .htm on my site's pages in the error log file. All the searching I did never came up with any examples that I thought would work. It could be because I wasn't sure what to call what I was looking for.

    The first mod_alias example worked like a dream. However, it was not the first one I tried. It was the other one. That one didn't want to work at first until I moved the line up near the top of the .htaccess file. It worked for some of the examples but not for any containing a space or a +. That's when I tried the mod_alias example and ran down the list of lines I wanted to wipe out. They all worked. Even the ones with the space or the +. There appeared to be no slow down accessing the page while the odd characters were found and removed.

    I really appreciate the help. I have the Library pages on the basics marked so I can study them to better understand what to do the next time.

    Thank you very much for the help,
    Rhoda