Peerbot or I misunderstood the robot exclusion protocol
| 6:55 am on Oct 6, 2004 (gmt 0)|
I'm sure questions like that have been covered, but I could not exactly find an answer to this dispute I got with peerbot.
Here's the situation:
Let's say, I have a file
in the root folder of that domain, there's a robots.txt with these lines:
So, on Sat, 04 Sep 2004 10:36:32 GMT+0100, peerbot visited a the file
/bots/file.htm. I sent an email message to peerbot saing that, in my opinion, their bot wasn't supposed to visit
The answer I got was:
|[...]first off, the protocol is working correctly for all of our services, the problem is that you missunderstood the protocol. |
> User-agent: *
> Disallow: /bots/
means that all bots are disallowed to index the directory /bots/ which peerbot does NOT do. To learn more about the robots exclusion protocol check the page www.robotstxt.org.[...]
Everything I know about the robots exclusion protocol is from robotstxt.org. From their documentation and from other posts here, I take it that the bot is not allowed to retrieve any document from the
Who is correct?
| 3:55 am on Oct 12, 2004 (gmt 0)|
robots.txt uses prefix-matching: "This can be a full path, or a partial path; any URL that starts with this value will not be retrieved." [emphasis added]
Peerbot's response is incorrect, as you have disallowed any resource whose local URL-path begins with /bots/
Their alternative view would require you to Disallow each file individually, which is ridiculous.
As a further example, let's take this:
Now, do they claim that this means only that they should not fetch the index file of the domain, and should feel free to fetch any other resources below that? No, any other robot interprets that as saying, "Do not fetch any resources from this site."
I think you were corresponding with someone who didn't know what they were talking about, or whose job it is to simply deflect any reports of problems with their 'bot. Obviously, they didn't read the Standard they directed you to, or they would have found the quote I cited above.
A 403-Forbidden response to their user-agent coded into your .htaccess file is an alternative approach they won't be able to ignore...
| 9:11 pm on Nov 5, 2004 (gmt 0)|
Thank you jdMorgan for your response. So I interpreted (and translated) the definitions correctly.
As my domains are on a Win/IIS server I do not have the power of a .htaccess, but I created something similar in asp (which of course uses more resources) which does the job as good for me.