Forum Moderators: phranque
How would I redirect:
realfile.html/folder/somethingwrong.html
and realfile.html/somethingwrong.html to realfile.html?
Or anything with a trailing slash after a filename?
I already have:
# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]
but so far I haven't been able to modify that, or find something else to correct this problem as well.
I'd be happy to send a 404 if that's easier.. just not a 200.
[edited by: LunaC at 7:41 pm (utc) on July 15, 2006]
This is the core problem, and you need to find out why. It's unacceptable.
One of the most common errors in server configuration occurs with the ErrorDocument directive. Even though the results are clearly explained in the documentation [httpd.apache.org], many ervers are configured using a canonical URL for ErrorDocument, and this causes problems. To clarify by example, the following code will result in a 302 response follwed by a 200-OK, rather than the desired 404-Not Found:
ErrorDocument 404 http://www.example.com/custom404.html
ErrorDocument 404 /custom404.html
Jim
I tried deleting my htaccess and testing the headers to see if I'd made an error but it still returns a 200. My 404 in my htaccess is written like you said, no canonical, just file.
ie: ErrorDocument 404 /404.shtml
If it's anything in my code, I can't find any errors. Header checks have been ran against 301 redirects, error pages (ie. example.com/oops etc.) and normal URLs. All are behaving as I'd expect except .shtml/ and .shtml/folder/etc.html.
I did also see my 404 page indexed in google from a not-there page. Tested headers there to.. 404.. Google glitch or the same problem?
I shudder to ask, what "complicated work-around" am I going to have to do? I'm ready to give it a shot. The bad files are spreading fast in the search engines. The other site I saw this happen to is now unlisted everywhere.. more than a bit scary to see it again.
[edited by: LunaC at 10:39 pm (utc) on July 16, 2006]
apache.org was one of the few sites that served a 404. After a few hours of trying to find the answer there, I still can't find how to set it to return a 404.
So it brings me back to wondering why the spiders are seeing it on mine. Anyone else seen this in their logs? How the heck do I fix it?
On Apache, once the eerver finds a filetype, it stops parsing the URL, so /foo.html/bar.php will always resolve to /foo.html with a 200-OK response.
But I'm also not clear on your testing status: Please describe what happens when you use the code you posted in your first message above: Does it do anything? What is the server response code when/if the redirect is followed? (Use the Live HTTP Headers Firefox extension for a complete and 'trustable' server headers report.)
That code came straight off one of my working servers in my "Fix Yahoo" section.
Or does your server just do nothing with that code in place? If so, have you got any other working rewrite rules at this point? -- You may need the requsitie preamble of
Options +FollowSymLinks
RewriteEngine on Also, as stated in the comments of that code, it is intended only to remove a trailing slash from a URL-path that contains a filetype (as indicated by the presence of a "." in the URL-path) before that trailing slash. It won't remove an attached 'extra' local-URL-path. To do that, the code needs to be modified by removing the pattern end-anchor and making the trailing slash optional:
# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)[b]/?[/b] http://www.example.com/$1 [R=301,L]
Jim
[edited by: jdMorgan at 4:45 pm (utc) on July 18, 2006]
I have this in my htaccess to correct an older problem:
# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]
When I test the headers for real-file.shtml/ I get a smooth 301 to real-file.shtml (exactly as I'd expect). Pages are slowly disappearing from search results that had the wrong url (this is very, very good), correct urls are remaining (Whew, excellent!). Working exactly as I'd hoped. This is the result I want for real-file.shtml/junk/more.shtml .. bad urls dropped.
When I test real-file.html/something-wrong.shtml I get a 200 OK.
The problem is that since page shows the content of real-file.html but at the location of real-file.html/something-wrong.shtml
(ie. bad url stays in address bar, any non absolute links are broken ie. css and images, but the content is readable and exactly the same as real-file.shtml). The engines are apparently seeing it as a perfectly valid URL.. but with 100% duplicate content.
When I check site:www.example.com I see more and more urls like:
www.example.com/real-file.shtml/something-wrong.html
www.example.com/another-file.shtml/folder-that-doesnt-exist/more-wrong.html
I had this happen on another site, I decided to let it go, thought the engines would sort it out. That site never recovered, 1 1/2 years later, still tanked in all engines. I can't know 100% this was the cause.. but it's eerily similar to what's starting now.
So what I'm trying to do is somehow let the bots know that those URLS are wrong, either by sending a 404 or 301ing them to the correct URL.
----------
I just tested the code you last posted in a totally empty htaccess, cache was cleared, browser restarted. Not what I expected at all.
It 301'd example.com/ to example.com/index.shtml ... weird! Firefox popped up the error page saying that the redirect will never finish. The page wouldn't load, no more headers sent after 301. I'm guessing the servers rules to force a trailing slash overwrote that and added the default file?
----------
apache.org and and webmasterworld.com both serve a 404 for real-file.html/junk. Seems like that should be the default for a bad url that never existed. I want my websites to do the same or at least 301 to get bad urls dropped.
And thank you for taking the time to try to help here.. I know you get asked a ton of questions.
I did find a problem in the code I posted... Yes, we get a lot of questions here, so mistakes are inevitable... :)
Try this instead:
# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]
So, the only practical difference between this and the originally-posted rule is that the end-anchor ($) has been removed, allowing this rule to 'drop' the trailing slash and anything that follows it.
Casual readers can take this thread as an object lesson in the fact that regular-expressions are subtle but extremely powerful. Just one little typo and... big trouble.
Jim
That's working perfectly. A nice bonus I didn't expect, the rule for .html/ is covered by the code you wrote above with a smooth 301 that as well.. so not even a larger .htaccess file.
Tested with the other rules I'd added before and all are smooth, 1 step, clean 301 redirect.
This couldn't have happened at a better time. I was having a burst of traffic from Yahoo.. I looked and it was to one of the bad urls. Fingers are crossed Slurp and Gbot see the redirect and get the right addresses listed eventually. For now though, the visitors are sent perfectly to where they wanted to go.
Again, a huge thanks!
I am having the exact same problem as LunaC and so I am going to try this fix...I don't want to do that bad thing with a single typo so just to be clear is this how you would put it all together: (I am not sure if the last lines were meant to replace part of the code or be added in.)
Options +FollowSymLinks
RewriteEngine on
# Remove multiple slashes anywhere in URL
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]
# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]
Plus I have this in my .htaccess already so do I just leave that alone or should it be merged with the code above:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
And again a huge thanks for your time, I could not have figured this out on my own..