Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies

Fear and Loathing of Robots.txt

Coming Clean - Confessions of a a robo'phobic



8:46 am on Jun 21, 2001 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

I'm not a fan of the robots.txt. I am timid and afraid of them. I've earned that phobia from a real world education.

There are problems with the ambiguous robots exclusion standard [info.webcrawler.com]. There are robots that ignore the standard. There are even search engines that take liberties with the standard. All of these problems combined can cause long term damage to a website.

The Horror:
When there is a robots.txt format syntax error, all a robot can do, is choose to ignore the file, or ignore the site. Most search engines choose to do the later. I've run into people in deep angst wondering why their site can't get indexed while having a bogus robots.txt online. I spent 3 weeks in 97 trying to figure out why a site was dropped from all search engines - yep, a bad robots.txt.

Bad Bad Search Engine Spider:
In the past there have been search engines that incorrectly read robots.txt. In late 96-early 97, if Infoseek found a Robots.txt, it just turned around and never indexed the site at all. Early robots would often work this way because they did not contain the logic to even read the robots.txt.

A growing Cancer, The Rogue Spider
To this day, I don't care for a robots.txt based on those bad first experiences with the standard. However, having run this site for a year, where 50%-75% of the hits are from spiders, it has become clear that something had to be done. Imagine how much faster this site would be if spiders weren't connecting day-in-day-out.

SE's taking Liberties
Another strike against robots.txt is the search engines themselves. I am a cloaker. After seeing some of the bigger engines wandering around with stock agent names, it became clear that protecting sites and content via robots.txt was not going to do the job.

Banning user agents, ip's, and the problem users can only go so far. Last week I looked at the logs here to find nearly a half million hits in a two day period from a rogue spider. Strangely enough, that spider actually requested robots.txt. It was and epiphany - I broke down and surrendered to putting a robots.txt online sometime in the near future.

Therapy for Robo'phobia
I wouldn't have done it without some prep work to ease my phobia. We all know you can't throw a claustrophobic in a closet and expect them to be cured. So some therapy was in order. I was able to put to rest the phobia this spring when I did a long analysis of 2.1 million sites and those that contained a robots.txt.

The ODP site robots.txt Crawl was a fascinating exercise to say the least. I found an average of 10% of the robots.txt on the net violated the standard in some way. A test of many of those sites found that search engines would read what it could of the robots.txt, and ignore the errors. Most of the sites with bad robots.txt's were still found in SE's. This went a long way to easing the fears.

The Validator
After seeing so many sites with bad robots.txt and the search engines still indexing them, I still wasn't convinced robots.txt was now safe to use. The next step was to reanalyze the robots.txt exclusion protocol itself. After doing so, I created the robots.txt validator [searchengineworld.com] at SEW just to ease the final trepidations.

Let it all out
I feel like I should issue a press release because I've just created the most comprehensive robots.txt I've ever put up.

I know some will think I'm kidding. When you've had sites and client sites wiped off the se's in one fell swoop because of a robots.txt error, you'd appreciate my hard earned fear. When there is a robots.txt error, it isn't just a single engine disaster, it's an ALL engine disaster. One error can rip your site from every SE on the net in short order.

*sigh* I feel better now.


10:22 am on Jun 21, 2001 (gmt 0)

10+ Year Member

just a small contribution - a couple of months ago the FAST spider was totally ignoring my robots.txt on a number of sites and the robots.txt was NOT in error. It seems to have corrected this more recently...


12:57 pm on Jun 21, 2001 (gmt 0)

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Great post, Brett. I had a site banned by AV after Scooter went nuts and scoured multiple directories he was supposedly excluded from. It took more than a year for the aftereffects of this to be completely reversed.

I hope you report on how well your extensive robots.txt works. I have never put things like e-mail harvesters in my robots.txt files because I assumed they would ignore them anyway. These guys aren't at the high end of the ethics spectrum, and there aren't any robot police. I suppose one could complain to the robot operator's ISP, but this is kind of an obscure technical issue compared to someone sending porn spam. If a spider is requesting pages that are public, I doubt if there is any legal support for them being obligated to honor the robots.txt file. BTW, all newer versions of Zeus WILL obey it, according to the author. Let us know how it works...


1:56 pm on Jun 21, 2001 (gmt 0)

10+ Year Member

Great post. It'd be very interesting to get views on how successful the e-mail harvester exclusions are.


1:16 am on Jun 22, 2001 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Brett, I too have been doing a lot of research into the robots.txt file and have a question for you. At the end of your file you have a section that uses the "wildcard" character and then disallows a list of directories.

You have a trailing slash after the directory names. After hours upon hours of research, my understanding is this...

When you add a trailing forward slash (/) to the end of the directory name like this:

Disallow: /private/

You are telling the spider that it cannot index the default.htm or index.htm for that directory. The robot will index everything else within that /private directory.

Is this true?

P.S. How do I get rid of this warning?

Robots.txt contains DOS or Mac line enders. Although it is so common most spiders will deal correctly with the file, it is not a valid format.

I'm using Notepad to create the robots.txt files. Should I be using something else?


5:23 am on Jun 22, 2001 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html

A trailing slash disallows anything from that directory:
Disallow: /
Goes for the whole server.

Disallow: /help/
Does the entire help directory and all files within it.

>Robots.txt contains DOS or Mac line enders.

Yes, it should be in Unix file format with Unix line enders. A good editor with a Unix mode is required. (NoteTab, Emacs, EditPlus, or a good Dos editor with a Unix mode). Notepad only does DOS/Windows line enders.

It is so common, that search engines just have to deal with it. I found atleats 20% of the robots.txt's have Dos line enders. Clearly, the search engines can handle it. However, since it is a validator, I have to hit the standard 100% on target with no exceptions - thus, the warning has to stay.


3:54 pm on Jun 22, 2001 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Thanks for clearing that up Brett. I downloaded NoteTab Pro and it took me a few tries but I finally got the robots.txt file to validate with no warnings! If using NoteTab, you choose the export feature from the file menu and export in UNIX format.


5:56 pm on Jun 22, 2001 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Brett, I would recommend putting some sort of statement on the validator page that alerts users to the DOS and Mac line enders issue. I'm sure there are a lot of people out there wondering why they are seeing that warning and how to eliminate it. As you stated, more than 20% of the ones you checked showed that warning which tells me that a lot of people are using Notepad which does not support the UNIX format.


6:06 pm on Jun 22, 2001 (gmt 0)

WebmasterWorld Senior Member mivox is a WebmasterWorld Top Contributor of All Time 10+ Year Member

For Mac users, BBEdit and BBEdit Lite can be set to save everything in Unix format. My BBEdit-created robots.txt just validated with flying colors.


11:04 am on Jun 23, 2001 (gmt 0)

10+ Year Member

Great thread Brett

I noticed the part about using notepad and it not being unix compatible .... There is another program I have started using instead of notepad and it is called textpad you can get it at that name .com and when you save a file save it as unix instead of pc in the file format option when you save pages.

I have not used notepad in almost two years now, and never had a problem ....


2:43 pm on Jun 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Just my .02 here.

I've been using Notepad and WS_FTP. When WS_FTP is set to ASCII mode when I send my robots.txt or any HTML file, evidently WS_FTP strips out the \r characters (carriage returns) and only sends the \n newline characters (or vice versa)

You can see this if you're using WS_FTP because the file size on my Win Box is x bytes less than the file size reported by the *NIX server, where x is the number of lines in the file.


1:23 am on Jun 25, 2001 (gmt 0)

WebmasterWorld Senior Member macguru is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Is Google a litte slow to dig the file? - - [22/Jun/2001:18:15:41 -0400] "GET /robots.txt HTTP/1.0" 200 92 - - [22/Jun/2001:18:15:41 -0400] "GET /images/Logo.gif HTTP/1.0" 200 1017
== - - [22/Jun/2001:19:13:39 -0400] "GET /robots.txt HTTP/1.0" 200 92 - - [22/Jun/2001:19:13:40 -0400] "GET /images/titre_colette_matte.jpg HTTP/1.0" 200 4209
== - - [22/Jun/2001:19:30:47 -0400] "GET /robots.txt HTTP/1.0" 200 92 - - [22/Jun/2001:19:30:48 -0400] "GET /images/raquette_des_vents_petit.jpg HTTP/1.0" 200 1552

And so on...

Extract from my robots.txt file :

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/