Forum Moderators: phranque

Message Too Old, No Replies

.html/anotherfile.html how to 301 to propper URL?

         

LunaC

7:35 pm on Jul 15, 2006 (gmt 0)

10+ Year Member



I've started seeing Gbot, and now today Slurp request very wrong urls, unfortunately my server is responding 200 OK. (I've checked, it isn't an error on my site linking wrong, I use absolute URLs except for css files)

How would I redirect:

realfile.html/folder/somethingwrong.html
and realfile.html/somethingwrong.html to realfile.html?

Or anything with a trailing slash after a filename?

I already have:

# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]

but so far I haven't been able to modify that, or find something else to correct this problem as well.

I'd be happy to send a 404 if that's easier.. just not a 200.

[edited by: LunaC at 7:41 pm (utc) on July 15, 2006]

jdMorgan

6:25 pm on Jul 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> unfortunately my server is responding 200 OK.

This is the core problem, and you need to find out why. It's unacceptable.

One of the most common errors in server configuration occurs with the ErrorDocument directive. Even though the results are clearly explained in the documentation [httpd.apache.org], many ervers are configured using a canonical URL for ErrorDocument, and this causes problems. To clarify by example, the following code will result in a 302 response follwed by a 200-OK, rather than the desired 404-Not Found:


ErrorDocument 404 http://www.example.com/custom404.html

The correct syntax is:

ErrorDocument 404 /custom404.html

Before getting into some complicated work-around, it would be worthwhile to check your code and your server headers to see if thies might be the problem.

Jim

LunaC

10:28 pm on Jul 16, 2006 (gmt 0)

10+ Year Member



I thought that seemed wrong, but I've seen this happen on another site that's on a different host.

I tried deleting my htaccess and testing the headers to see if I'd made an error but it still returns a 200. My 404 in my htaccess is written like you said, no canonical, just file.

ie: ErrorDocument 404 /404.shtml

If it's anything in my code, I can't find any errors. Header checks have been ran against 301 redirects, error pages (ie. example.com/oops etc.) and normal URLs. All are behaving as I'd expect except .shtml/ and .shtml/folder/etc.html.

I did also see my 404 page indexed in google from a not-there page. Tested headers there to.. 404.. Google glitch or the same problem?

I shudder to ask, what "complicated work-around" am I going to have to do? I'm ready to give it a shot. The bad files are spreading fast in the search engines. The other site I saw this happen to is now unlisted everywhere.. more than a bit scary to see it again.

[edited by: LunaC at 10:39 pm (utc) on July 16, 2006]

LunaC

4:21 pm on Jul 18, 2006 (gmt 0)

10+ Year Member



OK, after a few quick tests on other sites, most return a code 200 for file.html/shouldbea404.php.. even the W3C site returned a 200.

apache.org was one of the few sites that served a 404. After a few hours of trying to find the answer there, I still can't find how to set it to return a 404.

So it brings me back to wondering why the spiders are seeing it on mine. Anyone else seen this in their logs? How the heck do I fix it?

jdMorgan

4:44 pm on Jul 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, I think I got confused.

On Apache, once the eerver finds a filetype, it stops parsing the URL, so /foo.html/bar.php will always resolve to /foo.html with a 200-OK response.

But I'm also not clear on your testing status: Please describe what happens when you use the code you posted in your first message above: Does it do anything? What is the server response code when/if the redirect is followed? (Use the Live HTTP Headers Firefox extension for a complete and 'trustable' server headers report.)

That code came straight off one of my working servers in my "Fix Yahoo" section.

Or does your server just do nothing with that code in place? If so, have you got any other working rewrite rules at this point? -- You may need the requsitie preamble of

Options +FollowSymLinks
RewriteEngine on

if you don't already have that in your code.

Also, as stated in the comments of that code, it is intended only to remove a trailing slash from a URL-path that contains a filetype (as indicated by the presence of a "." in the URL-path) before that trailing slash. It won't remove an attached 'extra' local-URL-path. To do that, the code needs to be modified by removing the pattern end-anchor and making the trailing slash optional:


# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)[b]/?[/b] http://www.example.com/$1 [R=301,L]

The more details you post about how you tested, what your results were, and how those results differed from your expectations, the fewer possibilities will need to be discussed, and the faster you'll get this resolved.

Jim

[edited by: jdMorgan at 4:45 pm (utc) on July 18, 2006]

LunaC

10:29 pm on Jul 18, 2006 (gmt 0)

10+ Year Member



Yup, I test with Live HTTP Headers to.. great extension!

I have this in my htaccess to correct an older problem:

# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]

When I test the headers for real-file.shtml/ I get a smooth 301 to real-file.shtml (exactly as I'd expect). Pages are slowly disappearing from search results that had the wrong url (this is very, very good), correct urls are remaining (Whew, excellent!). Working exactly as I'd hoped. This is the result I want for real-file.shtml/junk/more.shtml .. bad urls dropped.

When I test real-file.html/something-wrong.shtml I get a 200 OK.

The problem is that since page shows the content of real-file.html but at the location of real-file.html/something-wrong.shtml
(ie. bad url stays in address bar, any non absolute links are broken ie. css and images, but the content is readable and exactly the same as real-file.shtml). The engines are apparently seeing it as a perfectly valid URL.. but with 100% duplicate content.

When I check site:www.example.com I see more and more urls like:
www.example.com/real-file.shtml/something-wrong.html
www.example.com/another-file.shtml/folder-that-doesnt-exist/more-wrong.html

I had this happen on another site, I decided to let it go, thought the engines would sort it out. That site never recovered, 1 1/2 years later, still tanked in all engines. I can't know 100% this was the cause.. but it's eerily similar to what's starting now.

So what I'm trying to do is somehow let the bots know that those URLS are wrong, either by sending a 404 or 301ing them to the correct URL.

----------
I just tested the code you last posted in a totally empty htaccess, cache was cleared, browser restarted. Not what I expected at all.

It 301'd example.com/ to example.com/index.shtml ... weird! Firefox popped up the error page saying that the redirect will never finish. The page wouldn't load, no more headers sent after 301. I'm guessing the servers rules to force a trailing slash overwrote that and added the default file?
----------

apache.org and and webmasterworld.com both serve a 404 for real-file.html/junk. Seems like that should be the default for a bad url that never existed. I want my websites to do the same or at least 301 to get bad urls dropped.

And thank you for taking the time to try to help here.. I know you get asked a ton of questions.

jdMorgan

1:50 am on Jul 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I can't explain the problem with that rule redirecting "/" to index.shtml without some other factor involved. More likely, it was mod_dir doing that URL-substitution, which then matched the rule. Since it's an external redirect, that would 'expose' the indes.shtml path in the browser address bar. Then the new index.shtml request would have matched the rule again, resulting in a loop.

I did find a problem in the code I posted... Yes, we get a lot of questions here, so mistakes are inevitable... :)
Try this instead:


# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]

That will require additional path info following /<something>.<something> to invoke the rule.

So, the only practical difference between this and the originally-posted rule is that the end-anchor ($) has been removed, allowing this rule to 'drop' the trailing slash and anything that follows it.

Casual readers can take this thread as an object lesson in the fact that regular-expressions are subtle but extremely powerful. Just one little typo and... big trouble.

Jim

LunaC

6:52 pm on Jul 19, 2006 (gmt 0)

10+ Year Member



:) I could hug you! Thank you!

That's working perfectly. A nice bonus I didn't expect, the rule for .html/ is covered by the code you wrote above with a smooth 301 that as well.. so not even a larger .htaccess file.

Tested with the other rules I'd added before and all are smooth, 1 step, clean 301 redirect.

This couldn't have happened at a better time. I was having a burst of traffic from Yahoo.. I looked and it was to one of the bad urls. Fingers are crossed Slurp and Gbot see the redirect and get the right addresses listed eventually. For now though, the visitors are sent perfectly to where they wanted to go.

Again, a huge thanks!

proboscis

12:49 am on Jul 20, 2006 (gmt 0)

10+ Year Member



Hi,

I am having the exact same problem as LunaC and so I am going to try this fix...I don't want to do that bad thing with a single typo so just to be clear is this how you would put it all together: (I am not sure if the last lines were meant to replace part of the code or be added in.)

Options +FollowSymLinks
RewriteEngine on
# Remove multiple slashes anywhere in URL
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]
# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]

Plus I have this in my .htaccess already so do I just leave that alone or should it be merged with the code above:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

And again a huge thanks for your time, I could not have figured this out on my own..

jdMorgan

2:55 am on Jul 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Put your 'old' code at the end just as you've posted it above, but you should remove the redundant second "RewriteEngine on".

Jim

Wibfision

3:09 pm on Jul 20, 2006 (gmt 0)

10+ Year Member



Fantastic, thanks! My site has suffered from this same problem for ages and I didn't know what to do about it. I stumbled across a message from LunaC in another thread and now have this "problem" solved for me in 5 minutes flat. Hugs all round :-)