another .htaccess/bot question

Forum Moderators: phranque

Message Too Old, No Replies

another .htaccess/bot question

amerrychase

7:06 am on Feb 7, 2010 (gmt 0)

Hi-
Newbie here, trying to get some respect from the bots. (And failing.) I've set up my .htaccess like so:

RewriteCond %{HTTP_USER_AGENT} ^**bot**
RewriteRule ^.* - [F,L]

and my robots.txt thusly:

User-agent: *
Disallow:/public_html/botsv/
Disallow:/cgi-bin/
Disallow:/forms/
etc.

I think that's the right way to do it. Anyway, my actual question is about the access logs. If a bot I've disallowed tries to access something, should it generate a 403 error? I've been getting 500 errors instead. One of them tried again a few hours after I updated .htaccess, and here's the log:

Http Code: 500 Date: Feb 06 23:56:48Http Version: HTTP/1.1Size in Bytes: 788 Referer: - Agent: **robot I've banned**

Good, bad, indifferent?
Thanks!

SteveWh

7:57 am on Feb 7, 2010 (gmt 0)

In robots.txt, I think you should have at least a space after each colon, but I don't know whether that's critical.

Your regular expression ^**bot**, depending on the name of the bot, might not need the ^ and certainly shouldn't have the ** , but I'm assuming those are just placeholders for the name of the bot.

The RewriteRule doesn't need the ^, also noncritical.

It may be possible to get a 500 instead of a 403 depending on the server's configuration in places you have no access to (if a shared server). Maybe try this line near the top of your .htaccess as an experiment:

ErrorDocument 403 "Forbidden"

What it will do is serve a plain text page (its content being just the word Forbidden) instead of searching the server for a 403 error page. If that gets rid of the 500's, it would seem to suggest that the server's search for an appropriate 403.shtml or equivalent error page is what's causing the 500 errors.

Your cPanel > Error Log might or might not give some additional info about how the 500's are getting generated.

wilderness

10:20 am on Feb 7, 2010 (gmt 0)

amerrychase,
If your getting 500 errors because of invalid synax, than it's likley that ALL visitors are seeing this error (and not gaining access to your site (s)), rather than just bots.

You really need to analyze your Eror logs, as Steve provided to determine the cause of the 500.

BTW, I'm sure you only provided an expressable version of your htaccess as example and as a result, you have Steve a bit confused.
I would suggest caution in copying and pasting massive portions of another's htaccess file into one on your own site (s):
1) the first reason is that many people that provide examples on a web page, don't even understand the procedures themselves.
2) these files may be very focused to suit individual configurations. (Using somebody whom desires to filter out more traffic than yourself, could certainly prove detrimental to your site or its customers).

Furthermore, robots.txt cannot cause 500 errors in htaccess.

jdMorgan

2:16 pm on Feb 7, 2010 (gmt 0)

The most likely problem is that you have implicitly banned the robot from fetching both your robots.txt file and your custom 403 error document (if you have one).

If it tries to fetch robots.txt, it'll get a 403. When the server tries to serve the custom 403 error document, that will result in another 403. So, as a result of this second 403, it will try again to serve the custom 403 error document, resulting in yet another 403... this will continue until the server gives up and throws a 500-Server Error.

You should allow *all* user-agents to fetch these two documents -- and any objects (css files, images, etc.) that are "included" in your custom 403 error page as well -- the number of which should be kept to an absolute minimum, preferably zero, BTW.

robots.txt:

User-agent: bad-bot-nickname
Disallow: /

User-agent: *
Disallow: /public_html/botsv/
Disallow: /cgi-bin/
Disallow: /forms/

Include the trailing blank line as shown.

.htaccess:


Options +FollowSymLinks -MultiViews
RewriteEngine on
#
# Deny access to all resources except robots.txt and custom 403 error document
RewriteCond %{HTTP_USER_AGENT} [i]bad-bot-user-agent-string[/i]
RewriteCond %{REQUEST_URI} !^/(robots\.txt|[i]path-to-your-custom-403-error-page\.html[/i])$
RewriteRule ^ - [F]

I included the two 'setup' directives usually required for use of mod_rewrite. If these are already present in your .htaccess file, then do not include this 'extra copy' of them.

[L] used with [F] is redundant.

Note the use of the two different terms for the bad-bot's "name" -- The "identifier" required in the robots.txt file is almost always different from the full user-agent-string name seen in your raw access log file, so I wanted to at least hint at this fact in the code.

You could put the exclusions for robots.txt and the custom 403 error document in the RewriteRule's pattern. I showed this exclusion as a separate RewriteCond primarily for reasons of clarity. You could delete that second RewriteCond and change the rule to


RewriteRule !^(robots\.txt|[i]path-to-your-custom-403-error-page\.html[/i])$ - [F]

which would be a bit faster.

Note that some Webmasters inadvertently declare a custom 403 error page by ticking a box in their control panel. This typically results in the file "/403.shtml" being declared as the custom 403 error page. However, ticking the box does not create this document, so a 404 will result if any attempt is made to fetch it. Therefore, we often see 403-404-403-404-403-500 error loops as well.

Jim

amerrychase

5:29 pm on Feb 7, 2010 (gmt 0)

I guess I'm not clear on the difference between the name and a user string. GingerCrawler/1.0 has been giving me some trouble. I had it named the same in both the robots.txt and .htaccess- do I actually need to include the whole string from the log file, which is:

gingercrawler/1.0 (Language Assistant for Dyslexics; www.gingersoftware.com/crawler_agent.htm; support at ginger software dot com)

? And if I have to include the whole string, what can I do about robots with really generic ones?

It was sort of working before, in that only who I put in the .htaccess was getting a 500, but now I'm getting one, too. Mmph.
Anyway, thanks for all your help!

wilderness

6:44 pm on Feb 7, 2010 (gmt 0)

And if I have to include the whole string

#to stop this pest and numerous others
RewriteCond %{HTTP_USER_AGENT} Crawl [NC]

jdMorgan

10:00 pm on Feb 7, 2010 (gmt 0)

> do I actually need to include the whole string from the log file...?

No, you just need to be aware that they are different.

You may Disallow Googlebot, for example. In robots.txt, that would be
User-agent Googlebot
But if you wanted to Deny access in .htacess, it would be wise to include more than just that, so that, for example, some other User-agent which contained "Googlebot" as a substring such as "Better-than-Googlebot/1.0" wouldn't also get blocked. (I'm stretching to provide a "familiar" example at the expense of providing a realistic one, here.)

Most 'good' robots will tell you what their 'nicknames' are on their "Webmaster Help" pages describing robots.txt. And of course you see the full user-agent string in your raw server logs.

Jim