homepage Welcome to WebmasterWorld Guest from 54.204.94.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
disallow vs ban in .htaccess
joergnw10

5+ Year Member



 
Msg#: 3523560 posted 6:12 pm on Dec 9, 2007 (gmt 0)

I would like to ban a bot which also has a specific IP address. Would the effects of disallowing the bot in robots.txt and banning the IP address in htaccess be the same? As I don't know whether this bot plays by the rules of robots.txt I would prefer to use htaccess.

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3523560 posted 6:31 pm on Dec 9, 2007 (gmt 0)

I would do the polite thing and disallow them in robots.txt and then block the user agent in .htaccess as well just to make sure they pay attention.

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3523560 posted 6:39 pm on Dec 9, 2007 (gmt 0)

If you block bot by user-agent in htaccess then this bot won't be able to get robots.txt. So, if it is a good bot that obeys robots.txt, then instead of this bot getting robots.txt and then going away, it will try to get robots.txt, fail and assume that it is okay to crawl other urls, which it will also fail to get, however this will result in increased load on your server as it will still have to deal with requests from bot.

Therefore the best way to deal with this problem is as follows:

1) disallow bot in robots.txt and allow anyone to take this file
2) ban requests to all other urls from those bots that should have obeyed robots.txt

joergnw10

5+ Year Member



 
Msg#: 3523560 posted 6:50 pm on Dec 9, 2007 (gmt 0)

Thanks very much for your replies.
That's more complex than I thought. I will start with the robots.txt as you suggested and then I'll have to work out how to ban requests to all files EXCEPT robots.txt.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3523560 posted 7:54 pm on Dec 9, 2007 (gmt 0)

If you're using Apache mod_access, then it's not difficult:

[i]SetEnvIf Request_URI "^(403\.html¦robots\.txt)$" allow-it[/i]
SetEnvIf User-Agent "larbin" bad-bot
SetEnvIf User-Agent "psycheclone" bad-bot
SetEnvIf User-Agent "Leacher" bad-bot
#
<Files *>
Order Deny,Allow
[i]Allow from env=allow-it[/i]
Deny from env=bad-bot
Deny from 38.0.0.0/8
</Files>

If you're using mod_rewrite, then simply add a RewriteCond to your RewriteRule:

[i]RewriteCond %{REQUEST_URI} !^(403\.html¦robots\.txt)$[/i]
RewriteCond %{HTTP_USER_AGENT} larbin [OR]
RewriteCond %{HTTP_USER_AGENT} psycheclone [OR]
RewriteCond %{HTTP_USER_AGENT} Leacher
RewriteRule .* - [F]

In both examples, all user-agents --including banned user-agents-- are allowed to fetch robots.txt and the custom 403 error page "403.html"

Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

Jim

joergnw10

5+ Year Member



 
Msg#: 3523560 posted 10:08 pm on Dec 9, 2007 (gmt 0)

Thanks for this, Jim. Just to make sure I have this right - for a specific address would this be ok? (00.000... being the IP address):
RewriteCond %{REQUEST_URI}!^(403\.html¦robots\.txt)$
RewriteCond %{HTTP_USER_AGENT} 00.000.00.00
RewriteRule .* - [F]
Thanks again!

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3523560 posted 10:31 pm on Dec 9, 2007 (gmt 0)

Yes, except that there must be a space between "}" and "!" (which gets deleted by this forum unless steps are taken to prevent it) and also the REQUEST_URI pattern must start with a slash (which I forgot), i.e.

RewriteCond %{REQUEST_URI} [b]!^/([/b]403\.html¦robots\.txt)$

The leading slash should also be added to the pattern in the example above using SetEnvIf.

Jim

joergnw10

5+ Year Member



 
Msg#: 3523560 posted 12:52 pm on Dec 10, 2007 (gmt 0)

Thanks for spelling it all out for me! I have now set it up but I'm not sure how to test if it works (apart from waiting for the bot to return). I disallowed my own IP address but I could still access my site - does this only work for bots or have I done something wrong?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3523560 posted 4:58 pm on Dec 10, 2007 (gmt 0)

OK, all you need to do is use some tool like this:
[rexswain.com...]

That lets you specify the page on your site and the user agent you wish to test.

Try this with both robots.txt and your index.html page and see what happens!

joergnw10

5+ Year Member



 
Msg#: 3523560 posted 5:41 pm on Dec 10, 2007 (gmt 0)

Fantastic! Looks like the bot is now blocked.
Only problem is that the custom error document does not seem to work here (works in other cases like forbidden directories). I'm getting this message:
Additionally,·a·403·Forbidden(LF)
error·was·encountered·while·trying·to·use·an·ErrorDocument·to·handle·the·request.
Any idea what could cause this?

My error documents in htaccess look like this:
ErrorDocument 403 /errors/403.htm
ErrorDocument 404 /errors/404.htm
ErrorDocument 500 /errors/500.htm
ErrorDocument 410 /errors/404.htm
(and I did change 'html' to 'htm' in Jim's code).

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3523560 posted 8:00 pm on Dec 10, 2007 (gmt 0)

maybe the RewriteCond should look more like this:
RewriteCond %{REQUEST_URI}!^/(errors/403\.htm¦errors/404\.htm¦errors/410\.htm¦errors/500\.htm¦robots\.txt)$

(be sure to add a space between "}" and "!" which gets deleted by this forum)

(assuming Document 410 was supposed to be /errors/410.htm)

joergnw10

5+ Year Member



 
Msg#: 3523560 posted 11:10 am on Dec 16, 2007 (gmt 0)

That did the trick! Thanks again to everyone.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved