Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt strategy: allow only good, or disallow individual bad?

Should my robots file say "Only google, yahoo, bing" or All are OK except.



7:06 pm on May 10, 2013 (gmt 0)

10+ Year Member

There seem to be three different approaches by the bigger websites out there when it comes to writing robots.txt:
  • Allow all (Google, nbcnews)
  • Allow all except for certain "known bad bots" (Wikipedia)
  • Allow only the "best of the best" search engines and disallow any other bot (Facebook, LinkedIn, Nike)

Over the years, I've built up a robots.txt file w/ more than 60 "known bad bots". It's obnoxious to try to maintain this in this manner - always adding/removing/modifying bot version numbers, etc. So I'm considering to moving to the "Allow only the 'best of the best'" model. Anyone else doing this currently? I have my cherry-picked bots that I'd like to add now but I'm sort of shy of pulling the trigger until the idea is vetted by some other folks.



1:28 am on May 11, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

Why are you futzing around with robots.txt? Bad bots probably don't read it and certainly don't obey it; the only option is to block 'em at the source. Lotsa ways to do this depending on server and personal preference.

"allow" unlike "disallow" is not a cast-in-stone part of the robots exclusion protocol, so robots can legitimately ignore it and still retain their halos. (The same applies to "Crawl-Delay".)

Version number should have no effect. I asked about this recently and someone-- phranque, I think-- pointed to a passage in The Rules that says "user-agent" should be interpreted broadly. If you're not sure a rule applies to you, assume it does.

If you have a big, high-traffic site you can start doing fancy things like serving each robot a custom robots.txt that names only itself, so the robot can't sneak off, change clothes and come back disguised as the googlebot to get wider access. Not that this would do the robot any good: It's more likely to net it a swift 403. For ordinary mortals a custom robots.txt isn't worth the trouble.


1:22 pm on May 13, 2013 (gmt 0)

10+ Year Member

Good info - thank you. I'm definitely in the "ordinary mortals" category.

My thinking behind this was that having a solidified robots.txt that says "Only these bots can do these specific things", to me, makes sense for justifying a super fast banning policy. I can look at my morning reports, find anyone scraping, and poof - you're out. I like having the "backup" that says, "I see you crawling, I've said you can't crawl, so now you are banned." And I say that to myself haha - I know "they" aren't listening. I was thinking that it just makes more sense and makes it easier for me.


4:04 pm on May 13, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

The trick to robots.txt is that you have to make it easy and tempting for robots to ignore it, or you may never notice which ones are bad. For example-- by the usual yawn-provoking coincidence I've only just posted about this--

In robots.txt:
User-Agent: *
Disallow: /honey

On your main page:
<p class = "honeypot"><a href = "/honey/">Go here! Good stuff! Mm-mm yum!</a></p>

In the CSS:
.honeypot {display: none;}

It has to be on the main page or most robots will never even get there.

Featured Threads

Hot Threads This Week

Hot Threads This Month