homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

disallow vs ban in .htaccess

 6:12 pm on Dec 9, 2007 (gmt 0)

I would like to ban a bot which also has a specific IP address. Would the effects of disallowing the bot in robots.txt and banning the IP address in htaccess be the same? As I don't know whether this bot plays by the rules of robots.txt I would prefer to use htaccess.



 6:31 pm on Dec 9, 2007 (gmt 0)

I would do the polite thing and disallow them in robots.txt and then block the user agent in .htaccess as well just to make sure they pay attention.

Lord Majestic

 6:39 pm on Dec 9, 2007 (gmt 0)

If you block bot by user-agent in htaccess then this bot won't be able to get robots.txt. So, if it is a good bot that obeys robots.txt, then instead of this bot getting robots.txt and then going away, it will try to get robots.txt, fail and assume that it is okay to crawl other urls, which it will also fail to get, however this will result in increased load on your server as it will still have to deal with requests from bot.

Therefore the best way to deal with this problem is as follows:

1) disallow bot in robots.txt and allow anyone to take this file
2) ban requests to all other urls from those bots that should have obeyed robots.txt


 6:50 pm on Dec 9, 2007 (gmt 0)

Thanks very much for your replies.
That's more complex than I thought. I will start with the robots.txt as you suggested and then I'll have to work out how to ban requests to all files EXCEPT robots.txt.


 7:54 pm on Dec 9, 2007 (gmt 0)

If you're using Apache mod_access, then it's not difficult:

[i]SetEnvIf Request_URI "^(403\.html¦robots\.txt)$" allow-it[/i]
SetEnvIf User-Agent "larbin" bad-bot
SetEnvIf User-Agent "psycheclone" bad-bot
SetEnvIf User-Agent "Leacher" bad-bot
<Files *>
Order Deny,Allow
[i]Allow from env=allow-it[/i]
Deny from env=bad-bot
Deny from

If you're using mod_rewrite, then simply add a RewriteCond to your RewriteRule:

[i]RewriteCond %{REQUEST_URI} !^(403\.html¦robots\.txt)$[/i]
RewriteCond %{HTTP_USER_AGENT} larbin [OR]
RewriteCond %{HTTP_USER_AGENT} psycheclone [OR]
RewriteCond %{HTTP_USER_AGENT} Leacher
RewriteRule .* - [F]

In both examples, all user-agents --including banned user-agents-- are allowed to fetch robots.txt and the custom 403 error page "403.html"

Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.



 10:08 pm on Dec 9, 2007 (gmt 0)

Thanks for this, Jim. Just to make sure I have this right - for a specific address would this be ok? (00.000... being the IP address):
RewriteCond %{REQUEST_URI}!^(403\.html¦robots\.txt)$
RewriteCond %{HTTP_USER_AGENT}
RewriteRule .* - [F]
Thanks again!


 10:31 pm on Dec 9, 2007 (gmt 0)

Yes, except that there must be a space between "}" and "!" (which gets deleted by this forum unless steps are taken to prevent it) and also the REQUEST_URI pattern must start with a slash (which I forgot), i.e.

RewriteCond %{REQUEST_URI} [b]!^/([/b]403\.html¦robots\.txt)$

The leading slash should also be added to the pattern in the example above using SetEnvIf.



 12:52 pm on Dec 10, 2007 (gmt 0)

Thanks for spelling it all out for me! I have now set it up but I'm not sure how to test if it works (apart from waiting for the bot to return). I disallowed my own IP address but I could still access my site - does this only work for bots or have I done something wrong?


 4:58 pm on Dec 10, 2007 (gmt 0)

OK, all you need to do is use some tool like this:

That lets you specify the page on your site and the user agent you wish to test.

Try this with both robots.txt and your index.html page and see what happens!


 5:41 pm on Dec 10, 2007 (gmt 0)

Fantastic! Looks like the bot is now blocked.
Only problem is that the custom error document does not seem to work here (works in other cases like forbidden directories). I'm getting this message:
Any idea what could cause this?

My error documents in htaccess look like this:
ErrorDocument 403 /errors/403.htm
ErrorDocument 404 /errors/404.htm
ErrorDocument 500 /errors/500.htm
ErrorDocument 410 /errors/404.htm
(and I did change 'html' to 'htm' in Jim's code).


 8:00 pm on Dec 10, 2007 (gmt 0)

maybe the RewriteCond should look more like this:
RewriteCond %{REQUEST_URI}!^/(errors/403\.htm¦errors/404\.htm¦errors/410\.htm¦errors/500\.htm¦robots\.txt)$

(be sure to add a space between "}" and "!" which gets deleted by this forum)

(assuming Document 410 was supposed to be /errors/410.htm)


 11:10 am on Dec 16, 2007 (gmt 0)

That did the trick! Thanks again to everyone.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved