Welcome to WebmasterWorld Guest from 54.158.253.134

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt to stop the BAD bots

There was a past post with the best robots.txt file?

     

Googly

11:32 am on Dec 11, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


I remember somewhere on Webmaster.com there was a thread showing the 'ideal' robots.txt file which aimed to stop all the bad spiders etc. Does anyone know where it is? I have searched high and low but to no avail!

Googly

12:04 pm on Dec 11, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 20, 2001
posts:478
votes: 0


robots.txt is the voluntary stopper, it will stop those legitimate spiders that observe it, not those who are trying to scrape email addresses, etc, for them you need htaccess. (and a good idea of what you are doing) Not sure where a list of those who do observe it is, perhaps someone will have something to add.
12:24 pm on Dec 11, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 27, 2002
posts:1685
votes: 0


Anytime you want to check something out beforehand, simply do a site search.

In this case:

[webmasterworld.com...]

Just be sure to read it/them all the way through in case there were any problems or additional information.

Pendanticist.

Googly

2:27 pm on Dec 11, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


Thanks, yeah I did a site search before, it's just that I used robots.txt instead of .htaccess

Whoops
Googly

3:01 pm on Dec 11, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 27, 2002
posts:1685
votes: 0


That's ok. In retrospect, I see you did mention that. <duh on my part>

Pendanticist.

5:22 am on Dec 12, 2002 (gmt 0)

Administrator from JP 

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 12, 2000
posts:14980
votes: 131


You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld [webmasterworld.com]. Brett also has a secton on Robots.txt Exclusion Standard Information [searchengineworld.com] over on SEW.
1:11 pm on Dec 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 27, 2002
posts:1685
votes: 0


bill,

Thanks for posting those links. However, it brings up an issue I've wondered about for awhile....


You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld. Brett also has a secton on Robots.txt Exclusion Standard Information over on SEW.

What's the difference between:


RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
RewriteCond %{HTTP_USER_AGENT} ^bumblebee [OR]

and the method used here?


[searchengineworld.com...] - Robots.txt Exclusion Standard Information

I'm still a little new at this and sometimes it gets just a tad confusing.

Thanks.

Pendanticist.

2:28 pm on Dec 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Actually, I think the WebmasterWorld robots.txt file is one of the worst examples I've seen.

pendanticist,

robots.txt asks for compliance but robots can choose to honor or obey it. A .htaccess file forces compliance, whether the robot likes it or not.

3:05 pm on Dec 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 27, 2002
posts:1685
votes: 0


Thanks Key_Master,

I appreciate the clarification.

(It'sucha lovely thing when the light brightens in the somewhat clouded world of webmastery.) :)

10:54 pm on Dec 12, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Aug 18, 2002
posts:131
votes: 0


For some bots I use a combination. I allow any request for robots.txt to be completed, but then block the bot by user agent further along in the file.

For example, ia_archiver is disallowed in my robots.txt. If it obeys, I see one 200 in my log instead of multiple 403s as it tries to access content on the site. And if it ever decides to disobey the robots protocol, I'm still protected.

11:18 pm on Dec 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


I agree with Key_Master, to a great extent. The WebmasterWorld robots.txt has a lot of Disallows for user-agents in it that won't obey robots.txt anyway, thus adding to "code bloat" in that file.

However, referring to what Finder said, there's a good reason to have a double-check in some cases, and the reason is to prevent UA spoofing - the use of a legitimate UA by a malicious program. I also have several agents that may be good or may be bad, (e.g. Python urllib) disallowed in robots.txt from accessing certain files. If this UA is used in a malicious way and disobeys robots.txt, it gets blocked by IP address automatically, thanks to K_M's bad-bot script. ...Works great!

Jim