htaccess ban lists mirroring robots.txt

Forum Moderators: phranque

Message Too Old, No Replies

htaccess ban lists mirroring robots.txt

is this the norm?

SteveJohnston

12:39 pm on Oct 21, 2004 (gmt 0)

A little out of my depth here, so please bear with me.

In the spirit of this classic thread, referring to the methods used to prevent crawler activity on a site:
[webmasterworld.com ]
what do you guys do?

Clearly the bots you don't want most, are the least likely to behave themselves and do what a robots.txt file says, so do you exclude them all in htaccesss as well?

If so, why do both?

What are the pros and cons of each approach?

Help much appreciated.

Steve

jdMorgan

3:16 pm on Oct 21, 2004 (gmt 0)

Steve,

Here's one way to do it:

Use robots.txt for "good" 'bots to control access to certain resources, e.g. the /cgi-bin directory.
Use robots.txt for unknown 'bots to Disallow access.
Use .htaccess for "bad" bots and unknowns to block access.

If an unknown 'bot violates robots.txt, remove that 'bot from robots.txt, leaving it in .htaccess.
If an unknown 'bot behaves itself, remove it from .htaccess, and allow it controlled access using robots.txt if you so desire.

You may choose to initially allow unknown 'bots, classing them with "good" bots. The above assumes that the unknown 'bots are truly unknown, and that you can't find out anything about them. In that case, it may be better to assume they are bad until you see them check robots.txt and leave.

Think of robots.txt directives as "requests" that only good 'bots will respect. Think of access restrictions in .htaccess as imperatives -- the .htaccess code is executed unconditionally, and so cannot be ignored by the given User-agent.

Jim

SteveJohnston

7:56 am on Oct 22, 2004 (gmt 0)

Sounds like a plan Jim. Thanks

Steve