homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Robots.txt strategy: allow only good, or disallow individual bad?
Should my robots file say "Only google, yahoo, bing" or All are OK except.

 7:06 pm on May 10, 2013 (gmt 0)

There seem to be three different approaches by the bigger websites out there when it comes to writing robots.txt:
  • Allow all (Google, nbcnews)
  • Allow all except for certain "known bad bots" (Wikipedia)
  • Allow only the "best of the best" search engines and disallow any other bot (Facebook, LinkedIn, Nike)

Over the years, I've built up a robots.txt file w/ more than 60 "known bad bots". It's obnoxious to try to maintain this in this manner - always adding/removing/modifying bot version numbers, etc. So I'm considering to moving to the "Allow only the 'best of the best'" model. Anyone else doing this currently? I have my cherry-picked bots that I'd like to add now but I'm sort of shy of pulling the trigger until the idea is vetted by some other folks.




 1:28 am on May 11, 2013 (gmt 0)

Why are you futzing around with robots.txt? Bad bots probably don't read it and certainly don't obey it; the only option is to block 'em at the source. Lotsa ways to do this depending on server and personal preference.

"allow" unlike "disallow" is not a cast-in-stone part of the robots exclusion protocol, so robots can legitimately ignore it and still retain their halos. (The same applies to "Crawl-Delay".)

Version number should have no effect. I asked about this recently and someone-- phranque, I think-- pointed to a passage in The Rules that says "user-agent" should be interpreted broadly. If you're not sure a rule applies to you, assume it does.

If you have a big, high-traffic site you can start doing fancy things like serving each robot a custom robots.txt that names only itself, so the robot can't sneak off, change clothes and come back disguised as the googlebot to get wider access. Not that this would do the robot any good: It's more likely to net it a swift 403. For ordinary mortals a custom robots.txt isn't worth the trouble.


 1:22 pm on May 13, 2013 (gmt 0)

Good info - thank you. I'm definitely in the "ordinary mortals" category.

My thinking behind this was that having a solidified robots.txt that says "Only these bots can do these specific things", to me, makes sense for justifying a super fast banning policy. I can look at my morning reports, find anyone scraping, and poof - you're out. I like having the "backup" that says, "I see you crawling, I've said you can't crawl, so now you are banned." And I say that to myself haha - I know "they" aren't listening. I was thinking that it just makes more sense and makes it easier for me.


 4:04 pm on May 13, 2013 (gmt 0)

The trick to robots.txt is that you have to make it easy and tempting for robots to ignore it, or you may never notice which ones are bad. For example-- by the usual yawn-provoking coincidence I've only just posted about this--

In robots.txt:
User-Agent: *
Disallow: /honey

On your main page:
<p class = "honeypot"><a href = "/honey/">Go here! Good stuff! Mm-mm yum!</a></p>

In the CSS:
.honeypot {display: none;}

It has to be on the main page or most robots will never even get there.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved