homepage Welcome to WebmasterWorld Guest from 54.211.113.223
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
robots.txt to stop the BAD bots
There was a past post with the best robots.txt file?
Googly

10+ Year Member



 
Msg#: 60 posted 11:32 am on Dec 11, 2002 (gmt 0)

I remember somewhere on Webmaster.com there was a thread showing the 'ideal' robots.txt file which aimed to stop all the bad spiders etc. Does anyone know where it is? I have searched high and low but to no avail!

Googly

 

SmallTime

10+ Year Member



 
Msg#: 60 posted 12:04 pm on Dec 11, 2002 (gmt 0)

robots.txt is the voluntary stopper, it will stop those legitimate spiders that observe it, not those who are trying to scrape email addresses, etc, for them you need htaccess. (and a good idea of what you are doing) Not sure where a list of those who do observe it is, perhaps someone will have something to add.

pendanticist

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 60 posted 12:24 pm on Dec 11, 2002 (gmt 0)

Anytime you want to check something out beforehand, simply do a site search.

In this case:

[webmasterworld.com...]

Just be sure to read it/them all the way through in case there were any problems or additional information.

Pendanticist.

Googly

10+ Year Member



 
Msg#: 60 posted 2:27 pm on Dec 11, 2002 (gmt 0)

Thanks, yeah I did a site search before, it's just that I used robots.txt instead of .htaccess

Whoops
Googly

pendanticist

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 60 posted 3:01 pm on Dec 11, 2002 (gmt 0)

That's ok. In retrospect, I see you did mention that. <duh on my part>

Pendanticist.

bill

WebmasterWorld Administrator bill us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 60 posted 5:22 am on Dec 12, 2002 (gmt 0)

You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld [webmasterworld.com]. Brett also has a secton on Robots.txt Exclusion Standard Information [searchengineworld.com] over on SEW.

pendanticist

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 60 posted 1:11 pm on Dec 12, 2002 (gmt 0)

bill,

Thanks for posting those links. However, it brings up an issue I've wondered about for awhile....


You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld. Brett also has a secton on Robots.txt Exclusion Standard Information over on SEW.

What's the difference between:


RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
RewriteCond %{HTTP_USER_AGENT} ^bumblebee [OR]

and the method used here?


[searchengineworld.com...] - Robots.txt Exclusion Standard Information

I'm still a little new at this and sometimes it gets just a tad confusing.

Thanks.

Pendanticist.

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 60 posted 2:28 pm on Dec 12, 2002 (gmt 0)

Actually, I think the WebmasterWorld robots.txt file is one of the worst examples I've seen.

pendanticist,

robots.txt asks for compliance but robots can choose to honor or obey it. A .htaccess file forces compliance, whether the robot likes it or not.

pendanticist

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 60 posted 3:05 pm on Dec 12, 2002 (gmt 0)

Thanks Key_Master,

I appreciate the clarification.

(It'sucha lovely thing when the light brightens in the somewhat clouded world of webmastery.) :)

Finder

10+ Year Member



 
Msg#: 60 posted 10:54 pm on Dec 12, 2002 (gmt 0)

For some bots I use a combination. I allow any request for robots.txt to be completed, but then block the bot by user agent further along in the file.

For example, ia_archiver is disallowed in my robots.txt. If it obeys, I see one 200 in my log instead of multiple 403s as it tries to access content on the site. And if it ever decides to disobey the robots protocol, I'm still protected.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 60 posted 11:18 pm on Dec 12, 2002 (gmt 0)

I agree with Key_Master, to a great extent. The WebmasterWorld robots.txt has a lot of Disallows for user-agents in it that won't obey robots.txt anyway, thus adding to "code bloat" in that file.

However, referring to what Finder said, there's a good reason to have a double-check in some cases, and the reason is to prevent UA spoofing - the use of a legitimate UA by a malicious program. I also have several agents that may be good or may be bad, (e.g. Python urllib) disallowed in robots.txt from accessing certain files. If this UA is used in a malicious way and disobeys robots.txt, it gets blocked by IP address automatically, thanks to K_M's bad-bot script. ...Works great!

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved