disallow ALL bots.and then allow the good ones?

Forum Moderators: goodroi

Message Too Old, No Replies

disallow ALL bots.and then allow the good ones?

tokey666

7:49 pm on Feb 25, 2008 (gmt 0)

I just heard from an SEO wiz that you should disallow ALL robots: User-agent: *
Disallow: /

Then after that, you can allow the good ones:
User-agent: Googlebot
Allow: /

User-agent: Slurp
Allow: /

etc etc.

This keeps ALL the bad bots out (well the ones who use the robots.txt) but allows access to the good ones.

Thoughts on this?

Lord Majestic

7:57 pm on Feb 25, 2008 (gmt 0)

A better command to allow crawl everything for a given bo t is:

Disallow:

ie:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: Slurp
Disallow:

---

Named bots can be mentioned in any order, though it might be better to mention them before User-agent: * directive.

---

More importantly for you is to know the consequences of your actions - by only allowing a select few bots you help maintain their monopoly and prevent startups with good bots that generally obey robots.txt from having level playing field. Bad bots won't care about robots.txt at all - people who may want to scrape your site for republishing won't care about it either, so really, you are not protecting yourself in any meaningful way, yet you will reinforce existing status quo of a handful of search engines driving traffic to your site.

My advice - allow all good robots.txt obeying bots to crawl your site.

tokey666

8:04 pm on Feb 25, 2008 (gmt 0)

Sound advice. I appreciate it.

I think I will go your route. What about filtering out bad robots that DO use the robots.txt? Is there a way to find a list of bad bots out there so I can tell them to disallow?

Lord Majestic

8:36 pm on Feb 25, 2008 (gmt 0)

Good! :)

There are indeed some bots that obey robots.txt, but they are known to be bad in terms of too many requests or what not - if you check robots.txt on site like Wikipedia you will see a fair few of those: this could be a quick approach, but really, if you don't notice bots then just let them go about their business - the worst offenders (site scrapers) won't care about robots.txt.

physics

8:40 pm on Feb 25, 2008 (gmt 0)

tokey666, make sure to test whatever robots.txt you want to try with the Google robots.txt validator (or whatever they call that).
To block the bad bots you'll need to do that with .htaccess (if you're using Apache) as they can just ignore robots.txt.