homepage Welcome to WebmasterWorld Guest from 54.204.94.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Blocking sherlock
quick question to confirm format
Reno

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 3:44 am on Jan 19, 2005 (gmt 0)

I have a perlscript installed that records the visits of bots to my site. Here is how a typical entry looks in the .dat file:

1/10/2005¦4.79.40.170¦sherlock/1.0¦¦

If I want to ban that specific bot but no others, is this the correct format for my robots.txt file:

User-agent: sherlock
Disallow: /

Thanks...

 

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 3:54 am on Jan 19, 2005 (gmt 0)

Based on my experience Sherlock, and its partner Holmes, which is an add-on for Sherlock do not read robots.txt However assuming they did that would be the correct entry in robots.txt.

Reno

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 4:11 am on Jan 19, 2005 (gmt 0)

Thanks Gary for the feedback. So if they refuse to honor the robots.txt protocol, then is it fair to say there is nothing that can be done?

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 4:16 am on Jan 19, 2005 (gmt 0)

There is still plenty you can do. I'm more familiar with Windows but both Windows and *nix have ways to forcibly block user agents. In *nix I think you use the .htaccess file. In Windows you can use ISAPI_Rewrite with ASP Classic to do basically the same thing as *nix folks use. That's what I do. I'm not sure if the built-in rewrite engine in ASP .NET lets you do this but I'll bet it does.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 533 posted 4:20 am on Jan 19, 2005 (gmt 0)

Robots.txt is a voluntary-compliance protocol. If the 'bots respect it, we respect them. If not, then there is blocking by user-agent, IP address, or behaviour. And blocking at the firewall/router as well.

Which of these you might use depends on what server you're running, and how much control of it you have, i.e. shared hosting account vs. the machine's sitting right next to you...

Jim

pendanticist

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 4:48 am on Jan 19, 2005 (gmt 0)

...

[edited by: pendanticist at 5:16 am (utc) on Jan. 19, 2005]

Reno

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 5:11 am on Jan 19, 2005 (gmt 0)

Thanks Gary and Jim. This site is hosted at a *nix server, so perhaps the best approach might be htaccess.

Jim -- I read your detailed response about "bad bods" at:
[webmasterworld.com...]

Do you recommend that approach for only blocking an occasional unwelcomed bot? (such as sherlock"), or is there a simpler .htaccess format?

For example, since posting my question I've read the following at:
[tedpavlic.com...]

### Forbid access from certain known-malicious browsers/bots
RewriteCond %{HTTP_USER_AGENT} nhnbot [NC,OR]
# Allow access to robots.txt and forbidden message
RewriteCond %{REQUEST_URI}!^/robots\.txt$

So would I just substitute "sherlock" for "nhnbot" and put all that into the htaccess file?

Edouard_H

10+ Year Member



 
Msg#: 533 posted 5:26 am on Jan 19, 2005 (gmt 0)

Sherlock & Holmes seem to qualify as good candidates to ban from what I've seen. This is what I have to outright forbid them (403):

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC]
RewriteRule .* - [F,L]

I've noticed that they return with caps reversed, eg first as Sherlock and holmes, then sherlock and Holmes, hence the [NC].

Reno

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 12:25 am on Jan 20, 2005 (gmt 0)

Thanks very much Edouard -- I'll use that format.

Quick question -- if other "bad bots" show up, do I just stack them on top of what you've started? As in:

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_1 [NC]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_2 [NC]
RewriteRule .* - [F,L]

Edouard_H

10+ Year Member



 
Msg#: 533 posted 1:21 am on Jan 20, 2005 (gmt 0)

For all but the last you'll need the "OR":

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_2 [NC]
RewriteRule .* - [F,L]

Reno

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 533 posted 2:03 am on Jan 20, 2005 (gmt 0)


Thank you again Edouard -- I'm happy to see that there is a fairly easy way to repel these nasty little critters!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved