Forum Moderators: goodroi
There are problems with the ambiguous robots exclusion standard [info.webcrawler.com]. There are robots that ignore the standard. There are even search engines that take liberties with the standard. All of these problems combined can cause long term damage to a website.
The Horror:
When there is a robots.txt format syntax error, all a robot can do, is choose to ignore the file, or ignore the site. Most search engines choose to do the later. I've run into people in deep angst wondering why their site can't get indexed while having a bogus robots.txt online. I spent 3 weeks in 97 trying to figure out why a site was dropped from all search engines - yep, a bad robots.txt.
Bad Bad Search Engine Spider:
In the past there have been search engines that incorrectly read robots.txt. In late 96-early 97, if Infoseek found a Robots.txt, it just turned around and never indexed the site at all. Early robots would often work this way because they did not contain the logic to even read the robots.txt.
A growing Cancer, The Rogue Spider
To this day, I don't care for a robots.txt based on those bad first experiences with the standard. However, having run this site for a year, where 50%-75% of the hits are from spiders, it has become clear that something had to be done. Imagine how much faster this site would be if spiders weren't connecting day-in-day-out.
SE's taking Liberties
Another strike against robots.txt is the search engines themselves. I am a cloaker. After seeing some of the bigger engines wandering around with stock agent names, it became clear that protecting sites and content via robots.txt was not going to do the job.
Banning user agents, ip's, and the problem users can only go so far. Last week I looked at the logs here to find nearly a half million hits in a two day period from a rogue spider. Strangely enough, that spider actually requested robots.txt. It was and epiphany - I broke down and surrendered to putting a robots.txt online sometime in the near future.
Therapy for Robo'phobia
I wouldn't have done it without some prep work to ease my phobia. We all know you can't throw a claustrophobic in a closet and expect them to be cured. So some therapy was in order. I was able to put to rest the phobia this spring when I did a long analysis of 2.1 million sites and those that contained a robots.txt.
The ODP site robots.txt Crawl was a fascinating exercise to say the least. I found an average of 10% of the robots.txt on the net violated the standard in some way. A test of many of those sites found that search engines would read what it could of the robots.txt, and ignore the errors. Most of the sites with bad robots.txt's were still found in SE's. This went a long way to easing the fears.
The Validator
After seeing so many sites with bad robots.txt and the search engines still indexing them, I still wasn't convinced robots.txt was now safe to use. The next step was to reanalyze the robots.txt exclusion protocol itself. After doing so, I created the robots.txt validator [searchengineworld.com] at SEW just to ease the final trepidations.
Let it all out
I feel like I should issue a press release because I've just created the most comprehensive robots.txt I've ever put up.
[searchengineworld.com...]
[webmasterworld.com...]
I know some will think I'm kidding. When you've had sites and client sites wiped off the se's in one fell swoop because of a robots.txt error, you'd appreciate my hard earned fear. When there is a robots.txt error, it isn't just a single engine disaster, it's an ALL engine disaster. One error can rip your site from every SE on the net in short order.
*sigh* I feel better now.
I hope you report on how well your extensive robots.txt works. I have never put things like e-mail harvesters in my robots.txt files because I assumed they would ignore them anyway. These guys aren't at the high end of the ethics spectrum, and there aren't any robot police. I suppose one could complain to the robot operator's ISP, but this is kind of an obscure technical issue compared to someone sending porn spam. If a spider is requesting pages that are public, I doubt if there is any legal support for them being obligated to honor the robots.txt file. BTW, all newer versions of Zeus WILL obey it, according to the author. Let us know how it works...
You have a trailing slash after the directory names. After hours upon hours of research, my understanding is this...
When you add a trailing forward slash (/) to the end of the directory name like this:Disallow: /private/
You are telling the spider that it cannot index the default.htm or index.htm for that directory. The robot will index everything else within that /private directory.
Is this true?
P.S. How do I get rid of this warning?
Robots.txt contains DOS or Mac line enders. Although it is so common most spiders will deal correctly with the file, it is not a valid format.
I'm using Notepad to create the robots.txt files. Should I be using something else?
Disallow:
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html
A trailing slash disallows anything from that directory:
Disallow: /
Goes for the whole server.
Disallow: /help/
Does the entire help directory and all files within it.
>Robots.txt contains DOS or Mac line enders.
Yes, it should be in Unix file format with Unix line enders. A good editor with a Unix mode is required. (NoteTab, Emacs, EditPlus, or a good Dos editor with a Unix mode). Notepad only does DOS/Windows line enders.
It is so common, that search engines just have to deal with it. I found atleats 20% of the robots.txt's have Dos line enders. Clearly, the search engines can handle it. However, since it is a validator, I have to hit the standard 100% on target with no exceptions - thus, the warning has to stay.
I noticed the part about using notepad and it not being unix compatible .... There is another program I have started using instead of notepad and it is called textpad you can get it at that name .com and when you save a file save it as unix instead of pc in the file format option when you save pages.
I have not used notepad in almost two years now, and never had a problem ....
I've been using Notepad and WS_FTP. When WS_FTP is set to ASCII mode when I send my robots.txt or any HTML file, evidently WS_FTP strips out the \r characters (carriage returns) and only sends the \n newline characters (or vice versa)
You can see this if you're using WS_FTP because the file size on my Win Box is x bytes less than the file size reported by the *NIX server, where x is the number of lines in the file.
216.239.46.14 - - [22/Jun/2001:18:15:41 -0400] "GET /robots.txt HTTP/1.0" 200 92
216.239.46.14 - - [22/Jun/2001:18:15:41 -0400] "GET /images/Logo.gif HTTP/1.0" 200 1017
==
216.239.46.91 - - [22/Jun/2001:19:13:39 -0400] "GET /robots.txt HTTP/1.0" 200 92
216.239.46.91 - - [22/Jun/2001:19:13:40 -0400] "GET /images/titre_colette_matte.jpg HTTP/1.0" 200 4209
==
216.239.46.31 - - [22/Jun/2001:19:30:47 -0400] "GET /robots.txt HTTP/1.0" 200 92
216.239.46.31 - - [22/Jun/2001:19:30:48 -0400] "GET /images/raquette_des_vents_petit.jpg HTTP/1.0" 200 1552
And so on...
Extract from my robots.txt file :
User-agent: *
Disallow: /images/
Disallow: /cgi-bin/