I'm not a fan of the robots.txt. I am timid and afraid of them. I've earned that phobia from a real world education.
There are problems with the ambiguous robots exclusion standard [info.webcrawler.com]. There are robots that ignore the standard. There are even search engines that take liberties with the standard. All of these problems combined can cause long term damage to a website.
When there is a robots.txt format syntax error, all a robot can do, is choose to ignore the file, or ignore the site. Most search engines choose to do the later. I've run into people in deep angst wondering why their site can't get indexed while having a bogus robots.txt online. I spent 3 weeks in 97 trying to figure out why a site was dropped from all search engines - yep, a bad robots.txt.
Bad Bad Search Engine Spider:
In the past there have been search engines that incorrectly read robots.txt. In late 96-early 97, if Infoseek found a Robots.txt, it just turned around and never indexed the site at all. Early robots would often work this way because they did not contain the logic to even read the robots.txt.
A growing Cancer, The Rogue Spider
To this day, I don't care for a robots.txt based on those bad first experiences with the standard. However, having run this site for a year, where 50%-75% of the hits are from spiders, it has become clear that something had to be done. Imagine how much faster this site would be if spiders weren't connecting day-in-day-out.
SE's taking Liberties
Another strike against robots.txt is the search engines themselves. I am a cloaker. After seeing some of the bigger engines wandering around with stock agent names, it became clear that protecting sites and content via robots.txt was not going to do the job.
Banning user agents, ip's, and the problem users can only go so far. Last week I looked at the logs here to find nearly a half million hits in a two day period from a rogue spider. Strangely enough, that spider actually requested robots.txt. It was and epiphany - I broke down and surrendered to putting a robots.txt online sometime in the near future.
Therapy for Robo'phobia
I wouldn't have done it without some prep work to ease my phobia. We all know you can't throw a claustrophobic in a closet and expect them to be cured. So some therapy was in order. I was able to put to rest the phobia this spring when I did a long analysis of 2.1 million sites and those that contained a robots.txt.
The ODP site robots.txt Crawl was a fascinating exercise to say the least. I found an average of 10% of the robots.txt on the net violated the standard in some way. A test of many of those sites found that search engines would read what it could of the robots.txt, and ignore the errors. Most of the sites with bad robots.txt's were still found in SE's. This went a long way to easing the fears.
After seeing so many sites with bad robots.txt and the search engines still indexing them, I still wasn't convinced robots.txt was now safe to use. The next step was to reanalyze the robots.txt exclusion protocol itself. After doing so, I created the robots.txt validator [searchengineworld.com] at SEW just to ease the final trepidations.
Let it all out
I feel like I should issue a press release because I've just created the most comprehensive robots.txt I've ever put up.
I know some will think I'm kidding. When you've had sites and client sites wiped off the se's in one fell swoop because of a robots.txt error, you'd appreciate my hard earned fear. When there is a robots.txt error, it isn't just a single engine disaster, it's an ALL engine disaster. One error can rip your site from every SE on the net in short order.
*sigh* I feel better now.