Welcome to WebmasterWorld Guest from 54.197.94.141

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt to stop the BAD bots

There was a past post with the best robots.txt file?

   
11:32 am on Dec 11, 2002 (gmt 0)

10+ Year Member



I remember somewhere on Webmaster.com there was a thread showing the 'ideal' robots.txt file which aimed to stop all the bad spiders etc. Does anyone know where it is? I have searched high and low but to no avail!

Googly

12:04 pm on Dec 11, 2002 (gmt 0)

10+ Year Member



robots.txt is the voluntary stopper, it will stop those legitimate spiders that observe it, not those who are trying to scrape email addresses, etc, for them you need htaccess. (and a good idea of what you are doing) Not sure where a list of those who do observe it is, perhaps someone will have something to add.
12:24 pm on Dec 11, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Anytime you want to check something out beforehand, simply do a site search.

In this case:

[webmasterworld.com...]

Just be sure to read it/them all the way through in case there were any problems or additional information.

Pendanticist.

2:27 pm on Dec 11, 2002 (gmt 0)

10+ Year Member



Thanks, yeah I did a site search before, it's just that I used robots.txt instead of .htaccess

Whoops
Googly

3:01 pm on Dec 11, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's ok. In retrospect, I see you did mention that. <duh on my part>

Pendanticist.

5:22 am on Dec 12, 2002 (gmt 0)

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld [webmasterworld.com]. Brett also has a secton on Robots.txt Exclusion Standard Information [searchengineworld.com] over on SEW.
1:11 pm on Dec 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bill,

Thanks for posting those links. However, it brings up an issue I've wondered about for awhile....


You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld. Brett also has a secton on Robots.txt Exclusion Standard Information over on SEW.

What's the difference between:


RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
RewriteCond %{HTTP_USER_AGENT} ^bumblebee [OR]

and the method used here?


[searchengineworld.com...] - Robots.txt Exclusion Standard Information

I'm still a little new at this and sometimes it gets just a tad confusing.

Thanks.

Pendanticist.

2:28 pm on Dec 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, I think the WebmasterWorld robots.txt file is one of the worst examples I've seen.

pendanticist,

robots.txt asks for compliance but robots can choose to honor or obey it. A .htaccess file forces compliance, whether the robot likes it or not.

3:05 pm on Dec 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Key_Master,

I appreciate the clarification.

(It'sucha lovely thing when the light brightens in the somewhat clouded world of webmastery.) :)

10:54 pm on Dec 12, 2002 (gmt 0)

10+ Year Member



For some bots I use a combination. I allow any request for robots.txt to be completed, but then block the bot by user agent further along in the file.

For example, ia_archiver is disallowed in my robots.txt. If it obeys, I see one 200 in my log instead of multiple 403s as it tries to access content on the site. And if it ever decides to disobey the robots protocol, I'm still protected.

11:18 pm on Dec 12, 2002 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I agree with Key_Master, to a great extent. The WebmasterWorld robots.txt has a lot of Disallows for user-agents in it that won't obey robots.txt anyway, thus adding to "code bloat" in that file.

However, referring to what Finder said, there's a good reason to have a double-check in some cases, and the reason is to prevent UA spoofing - the use of a legitimate UA by a malicious program. I also have several agents that may be good or may be bad, (e.g. Python urllib) disallowed in robots.txt from accessing certain files. If this UA is used in a malicious way and disobeys robots.txt, it gets blocked by IP address automatically, thanks to K_M's bad-bot script. ...Works great!

Jim