Forum Moderators: phranque

Message Too Old, No Replies

Problems with SEs 404 checking

         

Robber

10:32 am on Oct 12, 2004 (gmt 0)

10+ Year Member



Hi,

Just been looking through our access log and was surprised to see an entry:

/a-file-without-the-extension/qwer/erty/dfgh/zxcv/hjk/cvbn/rty/yuoi.htm

I figured this is just msnbot checking our server is configred to give 404s properly, which i thought it was so I wasn't surprised to see this entry until I saw it had a 200 status.

I checked other pages that don't work and they 404 properly.

It looks like this causing problems because the first folder in the request does actually exist as .htm filename so its as if this is confusing the server. Is there some way to tell apache not to try and be clever and assume something is a folder if it doesn't have an extension?

Cheers

[edited by: Robber at 10:34 am (utc) on Oct. 12, 2004]

Robber

10:34 am on Oct 12, 2004 (gmt 0)

10+ Year Member



I should have also added that when tested in a browser I can see that it is trying to load the page
/a-file-without-the-extension.htm

It doesn't display properly because all the relative paths are screwed.

Thanks

mincklerstraat

10:45 am on Oct 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My guess is that you're using mod_rewrite and it's directing all requests matching a not-specific-enough string (like this one) back to yuoi.htm. I've had this problem before, too. You might want to look into making your matching more specific, like instead of ([.]*)/youi.htm, ([^/]*)/youi.htm, which matches everything except the forward slash.

This probably isn't msn-bot checking your server to see if it does good 404's - I doubt msn-bot really gives a whee. SE spiders get confused easily and try to hit wrong url's all the time. Problem is, they can index whole sub-sites of your site that are identical to the main site, with zillions of duplicate content flags.

Robber

10:46 am on Oct 12, 2004 (gmt 0)

10+ Year Member



Looks like its to do with content_negotiation, just trying to find how to disable this.

Robber

10:55 am on Oct 12, 2004 (gmt 0)

10+ Year Member



Hi mincklerstraat,

Thanks for the input.

We dont use much mod_rewrite on this site, this is it really:
RewriteCond %{HTTP_HOST}!^www\.abc\.co\.uk
RewriteRule ^(.*)$ h**p://www.abc.co.uk$1 [R=301,L]

I am pretty sure the SEs have admitted to running 404 checkers in the past.

I think this content negotiation is the culprit but I dont know how to turn it off yet!

Thanks

jdMorgan

12:51 pm on Oct 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The directive

Options -MultiViews

Will disable content negotiation.

Jim

Robber

1:44 pm on Oct 12, 2004 (gmt 0)

10+ Year Member



Thanks jd, that is exactly what I was hoping someone would post, I had tried things like Options MultiViews Off

But obviously I made that up! Works like a dream, thanks again.