Forum Moderators: phranque

Message Too Old, No Replies

Blocking bad robots.

         

bhuether

1:05 am on Mar 13, 2008 (gmt 0)

10+ Year Member



I keep seeing entries likie this in my access log:

208.36.144.9 - - [12/Mar/2008:17:46:28 -0700] "GET /index.php?module=Members%20List&func=view&startnum=13601 HTTP/1.0" 500 590 "-" "Mozilla/5.0 (Twiceler-0.9 [cuill.com...]

But in htaccess I have

RewriteCond %{HTTP_USER_AGENT} Twiceler*
RewriteRule ^.* - [F,L]

How is it possible that this isn't working? When I go here:

[wannabrowser.com...]

and paste in the above user agent string it does indeed trigger the 403 error. Just doesn't make sense...

jdMorgan

2:38 am on Mar 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



you are getting a 500-Server error, rather than a 403. The most likely cause is that you're using a custom 403 ErrorDocument, but you have made no provisions to allow it to be served to a bad-bot. Because of this, a second 403 is generated, and that makes the server try to serve the custom 403 error document again, generating yet another 403. Once this happens ten times in a row or so, the server errors out with a 500-Server error.

I suggest that you always allow both your robots.txt and your custom 403 error page to be served, no matter what the requesting IP address, hostname, referrer, or user-agent is. This can be arranged most simply by prefacing your access-control code with an exception like this (I optimized your code slightly as well):


# Skip all following rules for robots.txt and 403 error page requests
RewriteRule (robot\.txt¦403-error-page\.html)$ - [L]
#
RewriteCond %{HTTP_USER_AGENT} Twiceler
RewriteRule .* - [F]

The reason it's desirable to allow all comers to fetch robots.txt is that some primitive 'bots --even those that might be "good bots"-- may interpret a 403 error on robots.txt as carte-blanche to spider your entire site. So they keep trying to fetch things and get denied again and again. That wastes a whole lot of bandwidth and pollutes your log and stats files. You might also say, "Fair is fair" and if you're going to deny access to robots which read and obey robots.txt, at least give them the courtesy of allowing them to read the robots.txt file to find this out. :)

Change the broken pipe "¦" character above to a solid pipe character before use; Posting here modifies the pipe characters.

Jim

wilderness

4:05 am on Mar 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_USER_AGENT} Twiceler*

What's the tailing asterik for?
Likely the reason for your 500, which BTW takes your entire site down.

jdMorgan

4:34 pm on Mar 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, that won't cause a server error. It will simply allow the pattern to match Twicele, Twiceler, Twicelerr, or Twicelerrrrrrrrrrr, since the "r*" sequence means, "match zero or more r's".

The server error is most likely the result a 403-Forbidden loop, and described in some detail above, and a very common problem.

Jim