Welcome to WebmasterWorld Guest from 54.146.248.111

Forum Moderators: goodroi

Message Too Old, No Replies

Blocking sherlock

quick question to confirm format

     

Reno

3:44 am on Jan 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a perlscript installed that records the visits of bots to my site. Here is how a typical entry looks in the .dat file:

1/10/2005¦4.79.40.170¦sherlock/1.0¦¦

If I want to ban that specific bot but no others, is this the correct format for my robots.txt file:

User-agent: sherlock
Disallow: /

Thanks...

GaryK

3:54 am on Jan 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Based on my experience Sherlock, and its partner Holmes, which is an add-on for Sherlock do not read robots.txt However assuming they did that would be the correct entry in robots.txt.

Reno

4:11 am on Jan 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Gary for the feedback. So if they refuse to honor the robots.txt protocol, then is it fair to say there is nothing that can be done?

GaryK

4:16 am on Jan 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There is still plenty you can do. I'm more familiar with Windows but both Windows and *nix have ways to forcibly block user agents. In *nix I think you use the .htaccess file. In Windows you can use ISAPI_Rewrite with ASP Classic to do basically the same thing as *nix folks use. That's what I do. I'm not sure if the built-in rewrite engine in ASP .NET lets you do this but I'll bet it does.

jdMorgan

4:20 am on Jan 19, 2005 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Robots.txt is a voluntary-compliance protocol. If the 'bots respect it, we respect them. If not, then there is blocking by user-agent, IP address, or behaviour. And blocking at the firewall/router as well.

Which of these you might use depends on what server you're running, and how much control of it you have, i.e. shared hosting account vs. the machine's sitting right next to you...

Jim

pendanticist

4:48 am on Jan 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



...

[edited by: pendanticist at 5:16 am (utc) on Jan. 19, 2005]

Reno

5:11 am on Jan 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Gary and Jim. This site is hosted at a *nix server, so perhaps the best approach might be htaccess.

Jim -- I read your detailed response about "bad bods" at:
[webmasterworld.com...]

Do you recommend that approach for only blocking an occasional unwelcomed bot? (such as sherlock"), or is there a simpler .htaccess format?

For example, since posting my question I've read the following at:
[tedpavlic.com...]

### Forbid access from certain known-malicious browsers/bots
RewriteCond %{HTTP_USER_AGENT} nhnbot [NC,OR]
# Allow access to robots.txt and forbidden message
RewriteCond %{REQUEST_URI}!^/robots\.txt$

So would I just substitute "sherlock" for "nhnbot" and put all that into the htaccess file?

Edouard_H

5:26 am on Jan 19, 2005 (gmt 0)

10+ Year Member



Sherlock & Holmes seem to qualify as good candidates to ban from what I've seen. This is what I have to outright forbid them (403):

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC]
RewriteRule .* - [F,L]

I've noticed that they return with caps reversed, eg first as Sherlock and holmes, then sherlock and Holmes, hence the [NC].

Reno

12:25 am on Jan 20, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks very much Edouard -- I'll use that format.

Quick question -- if other "bad bots" show up, do I just stack them on top of what you've started? As in:

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_1 [NC]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_2 [NC]
RewriteRule .* - [F,L]

Edouard_H

1:21 am on Jan 20, 2005 (gmt 0)

10+ Year Member



For all but the last you'll need the "OR":

RewriteCond %{HTTP_USER_AGENT} ^Holmes [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sherlock [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot_2 [NC]
RewriteRule .* - [F,L]

Reno

2:03 am on Jan 20, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




Thank you again Edouard -- I'm happy to see that there is a fairly easy way to repel these nasty little critters!
 

Featured Threads

Hot Threads This Week

Hot Threads This Month