Forum Moderators: phranque

Message Too Old, No Replies

What is the request and why is it different?

www.example.com/ and www.example.com-

         

StupidScript

8:25 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a PHP script that detects certain scrapers and sends a 404 error in response to their requests.

In my logs (Apache 2, RH 7.3), I see the script is effective, but I also see a potential problem.

Here is my question:

Why is this log entry request:

www.example.com/

returning a 404 error, like I intend it to, when this log entry request:
www.example.com-
is returning a 302 error?

What causes the second type of request log entry? Is it simply the difference between requests for

ht tp://www.example.com/
(first example) and
ht tp://www.example.com
(second example)?

The problem is that apparently the PHP script is not being used for the second example request. How would I include the second example, without killing all such requests regardless of their origin?

TIA.

jdMorgan

8:58 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The problem is basically that the second request is invalid -- It should have a trailing slash. However, since it's a common error, Apache's mod_dir is typically used to 'fix' these requests. You might want to check out whether mod_dir is installed and loaded on your server.

Also, there are many threads here concerning the trailing slash problem -- most resolved, but some not. They may give you some ideas to test and narrow down the problem.

Jim

StupidScript

11:06 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you for the response.

I've been reading many of the 'trailing slash' problem threads, but I don't understand *why* there is an issue at all.

I type in

ht tp://www.example.com
and receive
ht tp://www.example.com/index.php
. The same is true for entering
ht tp://www.example.com/
. In my log, both requests end up being for the one with the trailing slash. Also, in either case, as the page is loading, my browser shows the URL with the trailing slash (no file name in either case) ... whether I leave it off or not.

What kind of request would generate an error that results in a request for

ht tp://www.example.com-
?
Is my browser 'correcting' the address before sending the request? (IE6 & FF1.5)

<edit>Sorry ... I mean to say that my server has no problem handling either request, trailing slash or not, it's just that I'm trying to understand why any request would end with the hyphen, as noted. It's only found when the root domain address is the request.

I am wondering what kind of request causes the trailing hyphen in my log.

Note that the request with the trailing hyphen results in a 302, where the request with the trailing slash gets processed by my PHP script and results in a 404. Thanks!</edit>

jdMorgan

11:33 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, I didn't realize that the hyphen was part of the requested URL -- I thought it was part of your message mark-up.

You might want to investigate the IP addresses and hostnames of those user-agents requesting a hyphenated domain -- I'm actually surprised it resolves to your site at all -- and see if they look like they might be in IP ranges known for foul play.

Jim

StupidScript

11:50 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you, once again.

These log entries (with the trailing hyphen) are exclusively requested by bots of some kind. I see them as requests from known bots (Googlebot et al.) and from automatons of a different nature (voyager, cfetch, et al.).

I suspect you are correct in that the Apache log is indicating a bit of bad data, as it does for other elements in the log string (like

%{SID}e
). My guesses are currently leaning toward a bad bot configuration ... maybe it does that with a blank line at the end of an old array of file names to target, or something.

Since 302 means 'The requested resource resides temporarily under a different URI', I'm guessing the trailing hyphen log entry is the result of an outdated URL being probed, and the server tries to resolve to the current URI ... which ends up giving the bot a 404 when they finally get there.

I dunno. Always learning ...

Thanks again!