Welcome to WebmasterWorld Guest from 54.159.115.185

Forum Moderators: goodroi

Message Too Old, No Replies

Blocking sherlock

quick question to confirm format

     
3:44 am on Jan 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 9, 2001
posts:1307
votes: 0


I have a perlscript installed that records the visits of bots to my site. Here is how a typical entry looks in the .dat file:

1/10/2005¦4.79.40.170¦sherlock/1.0¦¦

If I want to ban that specific bot but no others, is this the correct format for my robots.txt file:

User-agent: sherlock
Disallow: /

Thanks...

3:54 am on Jan 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Based on my experience Sherlock, and its partner Holmes, which is an add-on for Sherlock do not read robots.txt However assuming they did that would be the correct entry in robots.txt.
4:11 am on Jan 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 9, 2001
posts:1307
votes: 0


Thanks Gary for the feedback. So if they refuse to honor the robots.txt protocol, then is it fair to say there is nothing that can be done?
4:16 am on Jan 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


There is still plenty you can do. I'm more familiar with Windows but both Windows and *nix have ways to forcibly block user agents. In *nix I think you use the .htaccess file. In Windows you can use ISAPI_Rewrite with ASP Classic to do basically the same thing as *nix folks use. That's what I do. I'm not sure if the built-in rewrite engine in ASP .NET lets you do this but I'll bet it does.
4:20 am on Jan 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Robots.txt is a voluntary-compliance protocol. If the 'bots respect it, we respect them. If not, then there is blocking by user-agent, IP address, or behaviour. And blocking at the firewall/router as well.

Which of these you might use depends on what server you're running, and how much control of it you have, i.e. shared hosting account vs. the machine's sitting right next to you...

Jim

4:48 am on Jan 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 27, 2002
posts:1685
votes: 0


...

[edited by: pendanticist at 5:16 am (utc) on Jan. 19, 2005]

5:11 am on Jan 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 9, 2001
posts:1307
votes: 0


Thanks Gary and Jim. This site is hosted at a *nix server, so perhaps the best approach might be htaccess.

Jim -- I read your detailed response about "bad bods" at:
[webmasterworld.com...]

Do you recommend that approach for only blocking an occasional unwelcomed bot? (such as sherlock"), or is there a simpler .htaccess format?

For example, since posting my question I've read the following at:
[tedpavlic.com...]

### Forbid access from certain known-malicious browsers/bots
RewriteCond %{HTTP_USER_AGENT} nhnbot [NC,OR]
# Allow access to robots.txt and forbidden message
RewriteCond %{REQUEST_URI}!^/robots\.txt$

So would I just substitute "sherlock" for "nhnbot" and put all that into the htaccess file?

5:26 am on Jan 19, 2005 (gmt 0)

Full Member

10+ Year Member

joined:Oct 9, 2002
posts:245
votes: 0


Sherlock & Holmes seem to qualify as good candidates to ban from what I've seen. This is what I have to outright forbid them (403):

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC]
RewriteRule .* - [F,L]

I've noticed that they return with caps reversed, eg first as Sherlock and holmes, then sherlock and Holmes, hence the [NC].

12:25 am on Jan 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 9, 2001
posts:1307
votes: 0


Thanks very much Edouard -- I'll use that format.

Quick question -- if other "bad bots" show up, do I just stack them on top of what you've started? As in:

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_1 [NC]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_2 [NC]
RewriteRule .* - [F,L]

1:21 am on Jan 20, 2005 (gmt 0)

Full Member

10+ Year Member

joined:Oct 9, 2002
posts:245
votes: 0


For all but the last you'll need the "OR":

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_2 [NC]
RewriteRule .* - [F,L]

2:03 am on Jan 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 9, 2001
posts:1307
votes: 0



Thank you again Edouard -- I'm happy to see that there is a fairly easy way to repel these nasty little critters!