Welcome to WebmasterWorld Guest from 54.205.60.49

Forum Moderators: goodroi

Message Too Old, No Replies

disallow vs ban in .htaccess

   
6:12 pm on Dec 9, 2007 (gmt 0)

10+ Year Member



I would like to ban a bot which also has a specific IP address. Would the effects of disallowing the bot in robots.txt and banning the IP address in htaccess be the same? As I don't know whether this bot plays by the rules of robots.txt I would prefer to use htaccess.
6:31 pm on Dec 9, 2007 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I would do the polite thing and disallow them in robots.txt and then block the user agent in .htaccess as well just to make sure they pay attention.
6:39 pm on Dec 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you block bot by user-agent in htaccess then this bot won't be able to get robots.txt. So, if it is a good bot that obeys robots.txt, then instead of this bot getting robots.txt and then going away, it will try to get robots.txt, fail and assume that it is okay to crawl other urls, which it will also fail to get, however this will result in increased load on your server as it will still have to deal with requests from bot.

Therefore the best way to deal with this problem is as follows:

1) disallow bot in robots.txt and allow anyone to take this file
2) ban requests to all other urls from those bots that should have obeyed robots.txt

6:50 pm on Dec 9, 2007 (gmt 0)

10+ Year Member



Thanks very much for your replies.
That's more complex than I thought. I will start with the robots.txt as you suggested and then I'll have to work out how to ban requests to all files EXCEPT robots.txt.
7:54 pm on Dec 9, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If you're using Apache mod_access, then it's not difficult:

[i]SetEnvIf Request_URI "^(403\.html¦robots\.txt)$" allow-it[/i]
SetEnvIf User-Agent "larbin" bad-bot
SetEnvIf User-Agent "psycheclone" bad-bot
SetEnvIf User-Agent "Leacher" bad-bot
#
<Files *>
Order Deny,Allow
[i]Allow from env=allow-it[/i]
Deny from env=bad-bot
Deny from 38.0.0.0/8
</Files>

If you're using mod_rewrite, then simply add a RewriteCond to your RewriteRule:

[i]RewriteCond %{REQUEST_URI} !^(403\.html¦robots\.txt)$[/i]
RewriteCond %{HTTP_USER_AGENT} larbin [OR]
RewriteCond %{HTTP_USER_AGENT} psycheclone [OR]
RewriteCond %{HTTP_USER_AGENT} Leacher
RewriteRule .* - [F]

In both examples, all user-agents --including banned user-agents-- are allowed to fetch robots.txt and the custom 403 error page "403.html"

Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

Jim

10:08 pm on Dec 9, 2007 (gmt 0)

10+ Year Member



Thanks for this, Jim. Just to make sure I have this right - for a specific address would this be ok? (00.000... being the IP address):
RewriteCond %{REQUEST_URI}!^(403\.html¦robots\.txt)$
RewriteCond %{HTTP_USER_AGENT} 00.000.00.00
RewriteRule .* - [F]
Thanks again!
10:31 pm on Dec 9, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Yes, except that there must be a space between "}" and "!" (which gets deleted by this forum unless steps are taken to prevent it) and also the REQUEST_URI pattern must start with a slash (which I forgot), i.e.

RewriteCond %{REQUEST_URI} [b]!^/([/b]403\.html¦robots\.txt)$

The leading slash should also be added to the pattern in the example above using SetEnvIf.

Jim

12:52 pm on Dec 10, 2007 (gmt 0)

10+ Year Member



Thanks for spelling it all out for me! I have now set it up but I'm not sure how to test if it works (apart from waiting for the bot to return). I disallowed my own IP address but I could still access my site - does this only work for bots or have I done something wrong?
4:58 pm on Dec 10, 2007 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



OK, all you need to do is use some tool like this:
[rexswain.com...]

That lets you specify the page on your site and the user agent you wish to test.

Try this with both robots.txt and your index.html page and see what happens!

5:41 pm on Dec 10, 2007 (gmt 0)

10+ Year Member



Fantastic! Looks like the bot is now blocked.
Only problem is that the custom error document does not seem to work here (works in other cases like forbidden directories). I'm getting this message:
Additionally,·a·403·Forbidden(LF)
error·was·encountered·while·trying·to·use·an·ErrorDocument·to·handle·the·request.
Any idea what could cause this?

My error documents in htaccess look like this:
ErrorDocument 403 /errors/403.htm
ErrorDocument 404 /errors/404.htm
ErrorDocument 500 /errors/500.htm
ErrorDocument 410 /errors/404.htm
(and I did change 'html' to 'htm' in Jim's code).

8:00 pm on Dec 10, 2007 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



maybe the RewriteCond should look more like this:
RewriteCond %{REQUEST_URI}!^/(errors/403\.htm¦errors/404\.htm¦errors/410\.htm¦errors/500\.htm¦robots\.txt)$

(be sure to add a space between "}" and "!" which gets deleted by this forum)

(assuming Document 410 was supposed to be /errors/410.htm)

11:10 am on Dec 16, 2007 (gmt 0)

10+ Year Member



That did the trick! Thanks again to everyone.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month