Forum Moderators: phranque

Message Too Old, No Replies

Server not reporting 404 error for malformed URLS

I'm sure there's an answer

         

jonrichd

6:55 pm on May 2, 2005 (gmt 0)

10+ Year Member



Both Yahoo and MSN have listed a non-existent page on my site in the form of

www.domain.com/file.htm/file

I do have a www.domain.com/file.htm on the site, and what I see when I go to the url is file.htm's contents, although the style sheet and other relative redirects don't show. Server header checker reports a 200 OK response. (oddly enough, this will happen if you try this with a file on WebmasterWorld.)

I tried to request this same scenario on a site with a different Apache server, and got the proper 404 error.

Does anybody have any ideas for configuring Apache to get a proper 404 error in this situation?

jdMorgan

2:00 am on May 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, this is a quirk of HTTP URL parsing; Any slash following a period is not interpeted as a continuation of the filepath, because the "." indicates the beginning of a file type, which by definition indicates that the end of the path has been reached at the preceding slash.

The work-around is fairly ugly, and relies on using Apache mod_rewrite [httpd.apache.org] to rewrite the request to a (another) file that is known not to exist.


Options +FollowSymLinks
RewriteEngine on
RewriteRule \.[^/]+/ /file_that_does_not_exist [L]

The RewriteRule pattern looks for any occurance of a "." preceding a slash, possibly followed by additional characters, and so should catch any url of the form <anything>.<something>/<anything>. This request is then rewritten to a file that does not exist, and so should invoke your 404 error handler. Since this is an internal rewrite (as opposed to an external redirect), the URL-path to the file that does not exist is not "published," so the search engine won't index that non-existent file.

You might also consider doing a "fix-up" redirect on this type of URL:


Options +FollowSymLinks
RewriteEngine on
RewriteRule ^(.+\.[^/]+)/ http:www.example.com/$1 [R=301,L]

This will tell any search engine that requests these malformed URLs to remove them from their index and replace them with the correct URL, rather than responding with a 404. This might be useful if the malformed URL is from a "popular" site and you don't want to lose the credit for the link.

Whether either of these work without interfering with your other URLs depends on whether you have any other URLs where a slash does follow a filetype. Since you didn't mention this, I assume not.

You may not need one or both of the first two lines of either example; they're only needed if these configurations are not already set in httpd.conf or your current .htaccess file.

[added]
A third alternative is to use the 410-Gone response, but this is only useful for HTTP/1.1 (and later) clients:


Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST} .
RewriteRule \.[^/]+/ - [G]

[/added]

Jim

jonrichd

12:00 am on May 4, 2005 (gmt 0)

10+ Year Member



Jim, you are on the money as usual. I've learned a lot from your posts; if I see you in New Orleans, I'll buy the beer.

- Jon