Welcome to WebmasterWorld Guest from 23.20.6.115

Forum Moderators: goodroi

Message Too Old, No Replies

disallow vs ban in .htaccess

     
6:12 pm on Dec 9, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 23, 2005
posts: 93
votes: 0


I would like to ban a bot which also has a specific IP address. Would the effects of disallowing the bot in robots.txt and banning the IP address in htaccess be the same? As I don't know whether this bot plays by the rules of robots.txt I would prefer to use htaccess.
6:31 pm on Dec 9, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


I would do the polite thing and disallow them in robots.txt and then block the user agent in .htaccess as well just to make sure they pay attention.
6:39 pm on Dec 9, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


If you block bot by user-agent in htaccess then this bot won't be able to get robots.txt. So, if it is a good bot that obeys robots.txt, then instead of this bot getting robots.txt and then going away, it will try to get robots.txt, fail and assume that it is okay to crawl other urls, which it will also fail to get, however this will result in increased load on your server as it will still have to deal with requests from bot.

Therefore the best way to deal with this problem is as follows:

1) disallow bot in robots.txt and allow anyone to take this file
2) ban requests to all other urls from those bots that should have obeyed robots.txt

6:50 pm on Dec 9, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 23, 2005
posts: 93
votes: 0


Thanks very much for your replies.
That's more complex than I thought. I will start with the robots.txt as you suggested and then I'll have to work out how to ban requests to all files EXCEPT robots.txt.
7:54 pm on Dec 9, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


If you're using Apache mod_access, then it's not difficult:

[i]SetEnvIf Request_URI "^(403\.html¦robots\.txt)$" allow-it[/i]
SetEnvIf User-Agent "larbin" bad-bot
SetEnvIf User-Agent "psycheclone" bad-bot
SetEnvIf User-Agent "Leacher" bad-bot
#
<Files *>
Order Deny,Allow
[i]Allow from env=allow-it[/i]
Deny from env=bad-bot
Deny from 38.0.0.0/8
</Files>

If you're using mod_rewrite, then simply add a RewriteCond to your RewriteRule:

[i]RewriteCond %{REQUEST_URI} !^(403\.html¦robots\.txt)$[/i]
RewriteCond %{HTTP_USER_AGENT} larbin [OR]
RewriteCond %{HTTP_USER_AGENT} psycheclone [OR]
RewriteCond %{HTTP_USER_AGENT} Leacher
RewriteRule .* - [F]

In both examples, all user-agents --including banned user-agents-- are allowed to fetch robots.txt and the custom 403 error page "403.html"

Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

Jim

10:08 pm on Dec 9, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 23, 2005
posts: 93
votes: 0


Thanks for this, Jim. Just to make sure I have this right - for a specific address would this be ok? (00.000... being the IP address):
RewriteCond %{REQUEST_URI}!^(403\.html¦robots\.txt)$
RewriteCond %{HTTP_USER_AGENT} 00.000.00.00
RewriteRule .* - [F]
Thanks again!
10:31 pm on Dec 9, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Yes, except that there must be a space between "}" and "!" (which gets deleted by this forum unless steps are taken to prevent it) and also the REQUEST_URI pattern must start with a slash (which I forgot), i.e.

RewriteCond %{REQUEST_URI} [b]!^/([/b]403\.html¦robots\.txt)$

The leading slash should also be added to the pattern in the example above using SetEnvIf.

Jim

12:52 pm on Dec 10, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 23, 2005
posts:93
votes: 0


Thanks for spelling it all out for me! I have now set it up but I'm not sure how to test if it works (apart from waiting for the bot to return). I disallowed my own IP address but I could still access my site - does this only work for bots or have I done something wrong?
4:58 pm on Dec 10, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


OK, all you need to do is use some tool like this:
[rexswain.com...]

That lets you specify the page on your site and the user agent you wish to test.

Try this with both robots.txt and your index.html page and see what happens!

5:41 pm on Dec 10, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 23, 2005
posts:93
votes: 0


Fantastic! Looks like the bot is now blocked.
Only problem is that the custom error document does not seem to work here (works in other cases like forbidden directories). I'm getting this message:
Additionally,·a·403·Forbidden(LF)
error·was·encountered·while·trying·to·use·an·ErrorDocument·to·handle·the·request.
Any idea what could cause this?

My error documents in htaccess look like this:
ErrorDocument 403 /errors/403.htm
ErrorDocument 404 /errors/404.htm
ErrorDocument 500 /errors/500.htm
ErrorDocument 410 /errors/404.htm
(and I did change 'html' to 'htm' in Jim's code).

8:00 pm on Dec 10, 2007 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 12


maybe the RewriteCond should look more like this:
RewriteCond %{REQUEST_URI}!^/(errors/403\.htm¦errors/404\.htm¦errors/410\.htm¦errors/500\.htm¦robots\.txt)$

(be sure to add a space between "}" and "!" which gets deleted by this forum)

(assuming Document 410 was supposed to be /errors/410.htm)

11:10 am on Dec 16, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 23, 2005
posts:93
votes: 0


That did the trick! Thanks again to everyone.