Welcome to WebmasterWorld Guest from 126.96.36.199
Forum Moderators: goodroi
Adjusted the url scheme so that "robots.txt" is _not_ forced. This allows people to check any robots.txt formated file. Thus, you can check development copies of robots.txt files without the chance of a robot running into the real one while it is in an invalid state.
Duplicate Agent Fields
Added several checks and warnings for duplicate agent fields - including wildcard parsing. Using duplicate wildcard agent names for multiple disallows is very common. However, I have been informed by one search engine, that they may have a problem with duplicate agent wildcards.
Although not specifically addressed by the robots.txt standard, formats such as the following may be a problem with some spiders:
Several more case check errors have been added for agent and disallow field names. There is some controversy about how the standard should be interpreted, so I felt the more strict interpretation should be used.
You have two terms. One is shorter than the other. The short term is the one you have in the robots.txt file. The long one is the one that has the URL that the spider is thinking about. Actually it's a non-issue which is which, because you are considering the situation only to the depth of the shortest term.
The URL that the spider is thinking about always starts with a slash, because the domain is stripped off by the spider's httpd daemon, per standard practice. A single slash represents your website's root directory. That's why a single slash represents a total Disallow.
You want a "leading letters" match to the depth of the shorter term. It's like a strncmp(A,B,X) where X is the length of the shorter term, and A and B are the two terms.
If you use /help/ then the match must be perfect 6 characters into the URL. If you use /help then the match must be perfect only 5 characters into the URL.
So the 6-character example is necessarily a directory disallow. But the 5-character match could be either a directory or something else.
A /help* is a no-no because you will never get a leading letters match on a URL this way. No URL has an asterisk in it.
Now this is why I don't like the apparently acceptable format of telling a spider that it's okay to spider everything by using this:
Disallow: [ nothing after the Disallow ]
It worries me for this reason:
Assuming that the spider properly disregards all white space after the colon that follows Disallow on that line, and that includes space and/or carriage return and/or line feed, it means that the shorter term now has a length of zero. Guess what? A standard string comparison to depth zero will return a match for any two strings (the null set is a true set) using strncmp(). That means the spider has to be smart enough to add a second test and eliminate the null set from the comparison. This makes me nervous.
Let me state it more succinctly: in most programming languages, the null set is a true set, and comparisons with a null in them will not throw an error, but will return true. In the robots.txt standard, a Disallow: with nothing after it is a null set, but the standard says that this should return a false by considering all URLs a mismatch with the null, and proceding to spider the entire site. My confidence that all spiders are testing for the null condition is not very high.
It's much better to think of the robots.txt as ONLY an exclusion standard. I can not conceive of a case where it's necessary to use nothing after the "Disallow:". Much better to leave it out.
The standard says that the first User-Agent that a particular spider encounters in a robots.txt, that applies to that particular spider via either a direct name or a wildcard, is the User-Agent that should apply to that spider. At that point, the spider has what it needs and should not be consulting the rest of the robots.txt.
That's the second thing I see in robots.txt that is done improperly lots of times. The arrangement of the various sections is important. It makes no sense to have a User-Agent: * on top and then a long list of specific bots below that, with their own Disallows.
Is my way of understanding this stuff reasonable?
Apparently it fell through the cracks.
A couple months ago, on a minor freebee site I have, I was trying to get Google to remove everything. The site has a URL that results from the fact that it was free web space from Sprint (now part of Earthlink). It looked like this:
Of course, I don't have access to anything above my directory. When removing stuff, you have to put in a robots.txt before you click on the Google remove. I can't remember how it turned out, but I do remember that I had a lot of trouble getting Google to understand that my robots.txt WAS in my root directory. Google's respondbot kept insisting that it was too far down in the site.