Forum Moderators: phranque
I'd really like a RedirectMatch solution, but here's the closest I've found for a mod_rewrite fix, posted by jdMorgan in an older thread:
Jim's example
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{query_STRING} ^aaaa=bbbb\.htm
RewriteRule ^index\.htm$ http://www.example.com/? [R=301,L]
Nope! Tried this and it gave a 500 server error
RewriteCond %{query_STRING} ^M=A
RewriteRule ^/widgets$ http://www.example.com/widgets? [R=301,L]
It'll be either /widgets/?N=M or (mostly) /widgets?A=D - many of them, and with two different seasonal directories: /gadgets/ and /widgets/ so it probably would do better with a generic fix, whichever directory it is.
I've tried at least 3 dozen variations, done other ways, and either it didn't work and I got the same URL returned or mostly 500 internal server error. I mean - at least 3 dozen tries today and more over the last year or two, since it killed those /subdirectories/
Is it the particular code, or could there be something else in or missing from .htaccess preventing it from working (I've had that happen).
I'll be killing the directories altogether and setting up subdomains to escape the issue for the time being, unless a fix is found but I suspect it's something deliberate being done, and it's bound to happen again and to other people also.
[edited by: Marcia at 6:22 am (utc) on Feb. 13, 2008]
So, ?a=111&b=22&c=3 and ?b=22&a=1&c=333 and ?c=333&b=22&a=1 (and another three formats) are ALL redirected to www.example.com/page/111/22/3/, forcing the www at the same time.
However, all URLs with extra parameters, missing parameters, or the wrong number of characters in the parameter value, are directly served the 404 page. The first parameter is always two digits. The second is always five digits. The third is always one digit. The site is already indexed with parameter URLs at present, so this migrates only "good" URLs to the new "static" format, and rejects all others. Those old URLs with "session IDs" are also rejected in the move (a separate set of rules strip the session ID out and redirects to the static "format"). Parameter-based URLs with included session IDs are those where people have cut and pasted a URL from their browser URL bar to a page on some other site. You only saw session IDs if you were logged in. They are now done with cookies instead.
For the "static" URLs, the 404 page is served if there are more or less than exactly three "parameter folders" (www.example.com/page/111/22/3/) present. If there are any normal parameters (?this.is.spam) on the end of the "static" URL, then a 301 redirect is forced to remove those parameters as well as force the www at the same time, too. If the number of characters in the URL is wrong, then the 404 page is also served.
Index files are redirected to remove the index file filename, and non-www is redirected to www.
Finally, there is a rewrite to connect the /page/111/22/3/ URL to the internal /somefile.php?a=111&b=22&c=3 server file path. The script itself does a check on the parameter values and if they are incorrect in any way it has its own 404 handling (say value "b" runs from "00001" to "15690" and you asked for "84360" for example).
Looking at the log files, it is apparent that there are a LOT of broken URLs in links pointing at the site. Looking at the Google site: command also shows many wierd responses, but as the new URLs are being indexed, there are no oddities noted with those. The "old" URLs are slowly dropping out. Yahoo is taking very much longer, but I am sure they will also get there in the end.
# Enable mod_rewrite
Options +FollowSymLinks
#
# Turn on the rewriting engine
RewriteEngine on
#
# If query string is bogus
RewriteCond %{QUERY_STRING} &?A=D&? [OR]
RewriteCond %{QUERY_STRING} &?M=A&? [OR]
RewriteCond %{QUERY_STRING} &?N=M&?
# and directory is /gadgets or /widgets, redirect to remove query string
RewriteRule ^((gadgets¦widgets).*)$ http://www.example.com/$1? [R=301,L]
# Enable mod_rewrite
Options +FollowSymLinks
#
# Turn on the rewriting engine
RewriteEngine on
#
# If query string is non-blank
RewriteCond %{QUERY_STRING} .
# and directory is /gadgets or /widgets, redirect to remove query string
RewriteRule ^((gadgets¦widgets).*)$ http://www.example.com/$1? [R=301,L]
or to remove query strings on all resources within the site:
# Enable mod_rewrite
Options +FollowSymLinks
#
# Turn on the rewriting engine
RewriteEngine on
#
# If query string is non-blank
RewriteCond %{QUERY_STRING} .
# redirect to remove query string
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
Everything depends on precisely what URLs and query strings are involved, whether your site is dynamic, and if so, whether all URL-paths within your site are dynamic; Use the most-specific solution if in doubt.
Notes:
Jim
For some reason correct URLs show up with extra characters added after the .htm. Here are some examples:
.htm >
.htm!>!m<!
.htm!>
.htm!>
.htm&_hash_;great+outdoors
.htm&_hash_;sea
.htm&familyfilter=1
.htm_
.htmindex.php
What I would like to do is find a simple way in the .htaccess file to remove everything after the .htm so the visitor would go to the correct URL which is there.
Recently another problem has appeared in the error log file which would seem to be similar and might be solved in the same way. The correct URL is there followed by a / and another file name. Here is an example:
/directory/filename.htm/filename.htm
The second file name is in the same directory but just needs to be removed with the /.
I have searched and searched for examples how I might fix these problems. The information in this thread seems close to what might work but I just don't know if it will. I just don't have enough knowledge to know.
Thanks for any help.
However, I've also got a serious "anti-abuse" streak in me, and mal-formed links are often intentional efforts to mess up a site's search results listings, so I'll make an exception here and present a short "Defense Against the Dark Arts" session:
For this specific case, where you have problems only with ".htm" files, the solution is easy, assuming you've got mod_rewrite or mod_alias installed, tested, and working on your server:
mod_alias:
RedirectMatch 301 ^(([^/]+/)*[^.]+\.htm).+$ http://www.example.com/$1
RewriteRule ^(([^/]+/)*[^.]+\.htm).+$ http://www.example.com/$1 [R=301,L]
However, since this shorter pattern is essentially un-anchored, many "trial" match attempts may be needed to match it, whereas the more complex pattern above can be match in a single left-to-right pass, except for the single internal loop looking for the last slash (if any) in the URL-path.
For more information, see the documentation cited in our Forum Charter.
Jim
The first mod_alias example worked like a dream. However, it was not the first one I tried. It was the other one. That one didn't want to work at first until I moved the line up near the top of the .htaccess file. It worked for some of the examples but not for any containing a space or a +. That's when I tried the mod_alias example and ran down the list of lines I wanted to wipe out. They all worked. Even the ones with the space or the +. There appeared to be no slow down accessing the page while the odd characters were found and removed.
I really appreciate the help. I have the Library pages on the basics marked so I can study them to better understand what to do the next time.
Thank you very much for the help,
Rhoda