robots.txt to stop the BAD bots

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt to stop the BAD bots

There was a past post with the best robots.txt file?

Googly

11:32 am on Dec 11, 2002 (gmt 0)

I remember somewhere on Webmaster.com there was a thread showing the 'ideal' robots.txt file which aimed to stop all the bad spiders etc. Does anyone know where it is? I have searched high and low but to no avail!

Googly

SmallTime

12:04 pm on Dec 11, 2002 (gmt 0)

robots.txt is the voluntary stopper, it will stop those legitimate spiders that observe it, not those who are trying to scrape email addresses, etc, for them you need htaccess. (and a good idea of what you are doing) Not sure where a list of those who do observe it is, perhaps someone will have something to add.

pendanticist

12:24 pm on Dec 11, 2002 (gmt 0)

Anytime you want to check something out beforehand, simply do a site search.

In this case:

[webmasterworld.com...]

Just be sure to read it/them all the way through in case there were any problems or additional information.

Pendanticist.

Googly

2:27 pm on Dec 11, 2002 (gmt 0)

Thanks, yeah I did a site search before, it's just that I used robots.txt instead of .htaccess

Whoops
Googly

pendanticist

3:01 pm on Dec 11, 2002 (gmt 0)

That's ok. In retrospect, I see you did mention that. <duh on my part>

Pendanticist.

bill

5:22 am on Dec 12, 2002 (gmt 0)

You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld [webmasterworld.com]. Brett also has a secton on Robots.txt Exclusion Standard Information [searchengineworld.com] over on SEW.

pendanticist

1:11 pm on Dec 12, 2002 (gmt 0)

bill,

Thanks for posting those links. However, it brings up an issue I've wondered about for awhile....

You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld. Brett also has a secton on Robots.txt Exclusion Standard Information over on SEW.

What's the difference between:

RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
RewriteCond %{HTTP_USER_AGENT} ^bumblebee [OR]

and the method used here?

[searchengineworld.com...] - Robots.txt Exclusion Standard Information

I'm still a little new at this and sometimes it gets just a tad confusing.

Thanks.

Pendanticist.

Key_Master

2:28 pm on Dec 12, 2002 (gmt 0)

Actually, I think the WebmasterWorld robots.txt file is one of the worst examples I've seen.

pendanticist,

robots.txt asks for compliance but robots can choose to honor or obey it. A .htaccess file forces compliance, whether the robot likes it or not.

pendanticist

3:05 pm on Dec 12, 2002 (gmt 0)

Thanks Key_Master,

I appreciate the clarification.

(It'sucha lovely thing when the light brightens in the somewhat clouded world of webmastery.) :)

Finder

10:54 pm on Dec 12, 2002 (gmt 0)

For some bots I use a combination. I allow any request for robots.txt to be completed, but then block the bot by user agent further along in the file.

For example, ia_archiver is disallowed in my robots.txt. If it obeys, I see one 200 in my log instead of multiple 403s as it tries to access content on the site. And if it ever decides to disobey the robots protocol, I'm still protected.

jdMorgan

11:18 pm on Dec 12, 2002 (gmt 0)

I agree with Key_Master, to a great extent. The WebmasterWorld robots.txt has a lot of Disallows for user-agents in it that won't obey robots.txt anyway, thus adding to "code bloat" in that file.

However, referring to what Finder said, there's a good reason to have a double-check in some cases, and the reason is to prevent UA spoofing - the use of a legitimate UA by a malicious program. I also have several agents that may be good or may be bad, (e.g. Python urllib) disallowed in robots.txt from accessing certain files. If this UA is used in a malicious way and disobeys robots.txt, it gets blocked by IP address automatically, thanks to K_M's bad-bot script. ...Works great!

Jim