homepage Welcome to WebmasterWorld Guest from 54.226.43.155
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
disallow vs ban in .htaccess
joergnw10




msg:3523562
 6:12 pm on Dec 9, 2007 (gmt 0)

I would like to ban a bot which also has a specific IP address. Would the effects of disallowing the bot in robots.txt and banning the IP address in htaccess be the same? As I don't know whether this bot plays by the rules of robots.txt I would prefer to use htaccess.

 

incrediBILL




msg:3523579
 6:31 pm on Dec 9, 2007 (gmt 0)

I would do the polite thing and disallow them in robots.txt and then block the user agent in .htaccess as well just to make sure they pay attention.

Lord Majestic




msg:3523583
 6:39 pm on Dec 9, 2007 (gmt 0)

If you block bot by user-agent in htaccess then this bot won't be able to get robots.txt. So, if it is a good bot that obeys robots.txt, then instead of this bot getting robots.txt and then going away, it will try to get robots.txt, fail and assume that it is okay to crawl other urls, which it will also fail to get, however this will result in increased load on your server as it will still have to deal with requests from bot.

Therefore the best way to deal with this problem is as follows:

1) disallow bot in robots.txt and allow anyone to take this file
2) ban requests to all other urls from those bots that should have obeyed robots.txt

joergnw10




msg:3523587
 6:50 pm on Dec 9, 2007 (gmt 0)

Thanks very much for your replies.
That's more complex than I thought. I will start with the robots.txt as you suggested and then I'll have to work out how to ban requests to all files EXCEPT robots.txt.

jdMorgan




msg:3523614
 7:54 pm on Dec 9, 2007 (gmt 0)

If you're using Apache mod_access, then it's not difficult:

[i]SetEnvIf Request_URI "^(403\.html¦robots\.txt)$" allow-it[/i]
SetEnvIf User-Agent "larbin" bad-bot
SetEnvIf User-Agent "psycheclone" bad-bot
SetEnvIf User-Agent "Leacher" bad-bot
#
<Files *>
Order Deny,Allow
[i]Allow from env=allow-it[/i]
Deny from env=bad-bot
Deny from 38.0.0.0/8
</Files>

If you're using mod_rewrite, then simply add a RewriteCond to your RewriteRule:

[i]RewriteCond %{REQUEST_URI} !^(403\.html¦robots\.txt)$[/i]
RewriteCond %{HTTP_USER_AGENT} larbin [OR]
RewriteCond %{HTTP_USER_AGENT} psycheclone [OR]
RewriteCond %{HTTP_USER_AGENT} Leacher
RewriteRule .* - [F]

In both examples, all user-agents --including banned user-agents-- are allowed to fetch robots.txt and the custom 403 error page "403.html"

Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

Jim

joergnw10




msg:3523719
 10:08 pm on Dec 9, 2007 (gmt 0)

Thanks for this, Jim. Just to make sure I have this right - for a specific address would this be ok? (00.000... being the IP address):
RewriteCond %{REQUEST_URI}!^(403\.html¦robots\.txt)$
RewriteCond %{HTTP_USER_AGENT} 00.000.00.00
RewriteRule .* - [F]
Thanks again!

jdMorgan




msg:3523730
 10:31 pm on Dec 9, 2007 (gmt 0)

Yes, except that there must be a space between "}" and "!" (which gets deleted by this forum unless steps are taken to prevent it) and also the REQUEST_URI pattern must start with a slash (which I forgot), i.e.

RewriteCond %{REQUEST_URI} [b]!^/([/b]403\.html¦robots\.txt)$

The leading slash should also be added to the pattern in the example above using SetEnvIf.

Jim

joergnw10




msg:3524076
 12:52 pm on Dec 10, 2007 (gmt 0)

Thanks for spelling it all out for me! I have now set it up but I'm not sure how to test if it works (apart from waiting for the bot to return). I disallowed my own IP address but I could still access my site - does this only work for bots or have I done something wrong?

incrediBILL




msg:3524232
 4:58 pm on Dec 10, 2007 (gmt 0)

OK, all you need to do is use some tool like this:
[rexswain.com...]

That lets you specify the page on your site and the user agent you wish to test.

Try this with both robots.txt and your index.html page and see what happens!

joergnw10




msg:3524269
 5:41 pm on Dec 10, 2007 (gmt 0)

Fantastic! Looks like the bot is now blocked.
Only problem is that the custom error document does not seem to work here (works in other cases like forbidden directories). I'm getting this message:
Additionally,·a·403·Forbidden(LF)
error·was·encountered·while·trying·to·use·an·ErrorDocument·to·handle·the·request.
Any idea what could cause this?

My error documents in htaccess look like this:
ErrorDocument 403 /errors/403.htm
ErrorDocument 404 /errors/404.htm
ErrorDocument 500 /errors/500.htm
ErrorDocument 410 /errors/404.htm
(and I did change 'html' to 'htm' in Jim's code).

phranque




msg:3524399
 8:00 pm on Dec 10, 2007 (gmt 0)

maybe the RewriteCond should look more like this:
RewriteCond %{REQUEST_URI}!^/(errors/403\.htm¦errors/404\.htm¦errors/410\.htm¦errors/500\.htm¦robots\.txt)$

(be sure to add a space between "}" and "!" which gets deleted by this forum)

(assuming Document 410 was supposed to be /errors/410.htm)

joergnw10




msg:3528877
 11:10 am on Dec 16, 2007 (gmt 0)

That did the trick! Thanks again to everyone.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved