I've never found it confusing; maybe I do too much C programming and need to get out more. But there are two other aspects to the robots.txt that I think need to be mentioned. First, this is how I see the slash thing:
You have two terms. One is shorter than the other. The short term is the one you have in the robots.txt file. The long one is the one that has the URL that the spider is thinking about. Actually it's a non-issue which is which, because you are considering the situation only to the depth of the shortest term.
The URL that the spider is thinking about always starts with a slash, because the domain is stripped off by the spider's httpd daemon, per standard practice. A single slash represents your website's root directory. That's why a single slash represents a total Disallow.
You want a "leading letters" match to the depth of the shorter term. It's like a strncmp(A,B,X) where X is the length of the shorter term, and A and B are the two terms.
If you use /help/ then the match must be perfect 6 characters into the URL. If you use /help then the match must be perfect only 5 characters into the URL.
So the 6-character example is necessarily a directory disallow. But the 5-character match could be either a directory or something else.
A /help* is a no-no because you will never get a leading letters match on a URL this way. No URL has an asterisk in it.
Now this is why I don't like the apparently acceptable format of telling a spider that it's okay to spider everything by using this:
Disallow: [ nothing after the Disallow ]
It worries me for this reason:
Assuming that the spider properly disregards all white space after the colon that follows Disallow on that line, and that includes space and/or carriage return and/or line feed, it means that the shorter term now has a length of zero. Guess what? A standard string comparison to depth zero will return a match for any two strings (the null set is a true set) using strncmp(). That means the spider has to be smart enough to add a second test and eliminate the null set from the comparison. This makes me nervous.
Let me state it more succinctly: in most programming languages, the null set is a true set, and comparisons with a null in them will not throw an error, but will return true. In the robots.txt standard, a Disallow: with nothing after it is a null set, but the standard says that this should return a false by considering all URLs a mismatch with the null, and proceding to spider the entire site. My confidence that all spiders are testing for the null condition is not very high.
It's much better to think of the robots.txt as ONLY an exclusion standard. I can not conceive of a case where it's necessary to use nothing after the "Disallow:". Much better to leave it out.
The standard says that the first User-Agent that a particular spider encounters in a robots.txt, that applies to that particular spider via either a direct name or a wildcard, is the User-Agent that should apply to that spider. At that point, the spider has what it needs and should not be consulting the rest of the robots.txt.
That's the second thing I see in robots.txt that is done improperly lots of times. The arrangement of the various sections is important. It makes no sense to have a User-Agent: * on top and then a long list of specific bots below that, with their own Disallows.
Is my way of understanding this stuff reasonable?