block user-agents from robots.txt

Forum Moderators: phranque

Message Too Old, No Replies

block user-agents from robots.txt

except googlebot

bubster119

11:32 pm on Oct 21, 2007 (gmt 0)

I've been searching through the posts in the forum - trying to work out an htaccess script which blocks mozilla & opera user agents (specifically browsers) from accessing my robots.txt file directly.

Here's what I've got so far:

RewriteCond %{HTTP_USER_AGENT} ^(Mozilla¦Opera)
RewriteCond %{HTTP_USER_AGENT}!^(Googlebot/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
RewriteCond %{HTTP_USER_AGENT}!^(Yahoo!\ )?Slurp;
RewriteRule ^robots\.txt$ /someotherfile [L]

My intention is to block all mozilla & opera user agents EXCEPT the bots which I choose to allow access to the robots file (yahoo and google).

Currently the code blocks EVERYTHING user agent and bots and I'm not sure why!?!

If anybody has any ideas I'd appreciate it.

Cheers

jdMorgan

1:45 pm on Oct 22, 2007 (gmt 0)

Your combinatorial logic is incorrect: Remove the [OR] flag from the second RewriteCond.

You want (Mozilla or Opera) AND NOT(Googlebot) AND NOT (Slurp)

I would, however, throw an [NC] flag on the end of the user-agent RewriteConds, just in case they accidentally (or intentionally) change their capitalization.

What you're doing may be dangerous; I'd recommend adding *all* of the top first- and second-tier robots to the "allowed" list, unless your alternate robots.txt file is a completely-valid robots.txt file which is designed to exclude them.

Overall, this approach is marginally useful; The people you need to worry about are well-aware of how to spoof user-agents, and won't be much slowed-down by this code, while at the same time, you take on the work to keep this code properly maintained.

Generally, it is best practice to allow *all* clients to access robots.txt, even the ones whose IP addresses are banned from all other pages on your site. The reason for this is simply that most robots treat blank or inaccessible robots.txt files as carte-blanche to spider your entire site. This exception can generally be included in the same code code that is used to always allow access to your custom 403 error page (if you use one).

Jim

bubster119

10:34 pm on Oct 22, 2007 (gmt 0)

Thanks Jim.

I hear what you're saying about the "carte-blanche" issues regarding blocking access to the actual robots.txt itself from all spiders. So I've just reinstated the original robots.txt file and made it accessible to all.

The aim of this exercise initially was just to prevent people from accessing the robots file directly - to provide an obstacle (albeit small) to discourage site leeching from the average Joe.

The more I've researched, the more I've come to realise that there is no sure fire way to prevent this and that as far as I can establish if someone really wants to leech your site, there is very little you can do about it.

I'm a graphic designer by trade and have limited knowledge of this aspect of web development - I think I was looking for a "quick fix - ban everybody except x,y,z" solution for my site, but it's beginning to dawn on me that this may be unrealistic.

From what I can establish it is more a case of keeping an eye on your logs and banning suspect bots as they come along - more longterm maintenance thing than one blanket solution.

Do you find that this is the case?

jdMorgan

12:02 am on Oct 23, 2007 (gmt 0)

Log-watching and banning by IP is one way to do it, but only practical for small sites. There's also a danger of becoming somewhat obsessive about it; In the long run, it can be a huge time-waster.

In addition to monitoring your logs and banning IP addresses and address ranges, you can also lay bad-bot traps, detect bad-bots by behaviour, and screen for known-problematic user-agents, invalid user-agents, and problematic IP address ranges. Our PERL scripting forum library contains several posts about a bad-bot trap script, and our PHP forum library has several about a behavioural detection and banning script.

To the extent possible, my advice is to make the computers do the work... :)

Jim

bubster119

9:21 am on Oct 23, 2007 (gmt 0)

Thanks again Jim,

The site is for personal & self-promotion purposes and I'm happy for visitors to only find it through google or yahoo.
Realistically most of my traffic will probably come from direct input of the uri anyway so I'm quite happy for it not to be listed in all search engines.

Could you see any problems with me creating a code which banned all bots from access to my whole site (except robots.txt and error pages) except for slurp and googlebot, directing all the banned bots to a 404? (I think 404 would be the right one).

I know I would have to be vigilant of any changes in the googlebot or slurp naming conventions but that seems less work for a site which doesn't need to be widely indexed.

I believe that some bots disguise themselves as browsers - so my bot ban wouldn't prevent those but surely this kind of approach would block the bulk?