homepage Welcome to WebmasterWorld Guest from 54.211.230.186
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Protocol interpretation
Peerbot or I misunderstood the robot exclusion protocol
WebJoe

10+ Year Member



 
Msg#: 459 posted 6:55 am on Oct 6, 2004 (gmt 0)

Hi

I'm sure questions like that have been covered, but I could not exactly find an answer to this dispute I got with peerbot.

Here's the situation:

Let's say, I have a file
www.somedomain.tld/bots/file.htm

in the root folder of that domain, there's a robots.txt with these lines:
User-agent: *
Disallow: /bots/

So, on Sat, 04 Sep 2004 10:36:32 GMT+0100, peerbot visited a the file /bots/file.htm. I sent an email message to peerbot saing that, in my opinion, their bot wasn't supposed to visit /bots/file.htm.
The answer I got was:
[...]first off, the protocol is working correctly for all of our services, the problem is that you missunderstood the protocol.

> User-agent: *
> Disallow: /bots/

means that all bots are disallowed to index the directory /bots/ which peerbot does NOT do. To learn more about the robots exclusion protocol check the page www.robotstxt.org.[...]

Everything I know about the robots exclusion protocol is from robotstxt.org. From their documentation and from other posts here, I take it that the bot is not allowed to retrieve any document from the /bots/-folder.

Who is correct?

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 459 posted 3:55 am on Oct 12, 2004 (gmt 0)

robots.txt uses prefix-matching: "This can be a full path, or a partial path; any URL that starts with this value will not be retrieved." [emphasis added]

Peerbot's response is incorrect, as you have disallowed any resource whose local URL-path begins with /bots/

Their alternative view would require you to Disallow each file individually, which is ridiculous.

As a further example, let's take this:

User-agent: *
Disallow: /

Now, do they claim that this means only that they should not fetch the index file of the domain, and should feel free to fetch any other resources below that? No, any other robot interprets that as saying, "Do not fetch any resources from this site."

I think you were corresponding with someone who didn't know what they were talking about, or whose job it is to simply deflect any reports of problems with their 'bot. Obviously, they didn't read the Standard they directed you to, or they would have found the quote I cited above.

A 403-Forbidden response to their user-agent coded into your .htaccess file is an alternative approach they won't be able to ignore...

Jim

WebJoe

10+ Year Member



 
Msg#: 459 posted 9:11 pm on Nov 5, 2004 (gmt 0)

Thank you jdMorgan for your response. So I interpreted (and translated) the definitions correctly.

As my domains are on a Win/IIS server I do not have the power of a .htaccess, but I created something similar in asp (which of course uses more resources) which does the job as good for me.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved