homepage Welcome to WebmasterWorld Guest from 54.198.42.105
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
robots.txt to stop the BAD bots
There was a past post with the best robots.txt file?
Googly




msg:1525910
 11:32 am on Dec 11, 2002 (gmt 0)

I remember somewhere on Webmaster.com there was a thread showing the 'ideal' robots.txt file which aimed to stop all the bad spiders etc. Does anyone know where it is? I have searched high and low but to no avail!

Googly

 

SmallTime




msg:1525911
 12:04 pm on Dec 11, 2002 (gmt 0)

robots.txt is the voluntary stopper, it will stop those legitimate spiders that observe it, not those who are trying to scrape email addresses, etc, for them you need htaccess. (and a good idea of what you are doing) Not sure where a list of those who do observe it is, perhaps someone will have something to add.

pendanticist




msg:1525912
 12:24 pm on Dec 11, 2002 (gmt 0)

Anytime you want to check something out beforehand, simply do a site search.

In this case:

[webmasterworld.com...]

Just be sure to read it/them all the way through in case there were any problems or additional information.

Pendanticist.

Googly




msg:1525913
 2:27 pm on Dec 11, 2002 (gmt 0)

Thanks, yeah I did a site search before, it's just that I used robots.txt instead of .htaccess

Whoops
Googly

pendanticist




msg:1525914
 3:01 pm on Dec 11, 2002 (gmt 0)

That's ok. In retrospect, I see you did mention that. <duh on my part>

Pendanticist.

bill




msg:1525915
 5:22 am on Dec 12, 2002 (gmt 0)

You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld [webmasterworld.com]. Brett also has a secton on Robots.txt Exclusion Standard Information [searchengineworld.com] over on SEW.

pendanticist




msg:1525916
 1:11 pm on Dec 12, 2002 (gmt 0)

bill,

Thanks for posting those links. However, it brings up an issue I've wondered about for awhile....


You'll probably get your best example of a robots.txt file by looking at the one for WebmasterWorld. Brett also has a secton on Robots.txt Exclusion Standard Information over on SEW.

What's the difference between:


RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
RewriteCond %{HTTP_USER_AGENT} ^bumblebee [OR]

and the method used here?


[searchengineworld.com...] - Robots.txt Exclusion Standard Information

I'm still a little new at this and sometimes it gets just a tad confusing.

Thanks.

Pendanticist.

Key_Master




msg:1525917
 2:28 pm on Dec 12, 2002 (gmt 0)

Actually, I think the WebmasterWorld robots.txt file is one of the worst examples I've seen.

pendanticist,

robots.txt asks for compliance but robots can choose to honor or obey it. A .htaccess file forces compliance, whether the robot likes it or not.

pendanticist




msg:1525918
 3:05 pm on Dec 12, 2002 (gmt 0)

Thanks Key_Master,

I appreciate the clarification.

(It'sucha lovely thing when the light brightens in the somewhat clouded world of webmastery.) :)

Finder




msg:1525919
 10:54 pm on Dec 12, 2002 (gmt 0)

For some bots I use a combination. I allow any request for robots.txt to be completed, but then block the bot by user agent further along in the file.

For example, ia_archiver is disallowed in my robots.txt. If it obeys, I see one 200 in my log instead of multiple 403s as it tries to access content on the site. And if it ever decides to disobey the robots protocol, I'm still protected.

jdMorgan




msg:1525920
 11:18 pm on Dec 12, 2002 (gmt 0)

I agree with Key_Master, to a great extent. The WebmasterWorld robots.txt has a lot of Disallows for user-agents in it that won't obey robots.txt anyway, thus adding to "code bloat" in that file.

However, referring to what Finder said, there's a good reason to have a double-check in some cases, and the reason is to prevent UA spoofing - the use of a legitimate UA by a malicious program. I also have several agents that may be good or may be bad, (e.g. Python urllib) disallowed in robots.txt from accessing certain files. If this UA is used in a malicious way and disobeys robots.txt, it gets blocked by IP address automatically, thanks to K_M's bad-bot script. ...Works great!

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved