homepage Welcome to WebmasterWorld Guest from 23.22.194.120
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Do proxies and Internet caches respect robots.txt?
Scooter24




msg:1525260
 6:34 pm on Feb 2, 2003 (gmt 0)

Will an Internet cache or a proxy load pages in directories disallowed in robots.txt?

 

bird




msg:1525261
 6:45 pm on Feb 2, 2003 (gmt 0)

A proxy/cache is not a robot, so robots.txt doesn't apply to it. All requests made through those types of systems are supposedly caused by a human clicking on a link somewhere.

Robots are programs that crawl through pages and follow links without (or with very little) human interaction. Robots.txt is one of the tools that prevent those to run into loops and other traps that a human would avoid intuitively.

jdMorgan




msg:1525262
 6:55 pm on Feb 2, 2003 (gmt 0)

...And to cover the other half of the question, try this tutorial [mnot.net].

HTH,
Jim

andreasfriedrich




msg:1525263
 7:42 pm on Feb 2, 2003 (gmt 0)

Now compare what´s been said here with that [webmasterworld.com] ;).

Andreas

Scooter24




msg:1525264
 8:12 pm on Feb 2, 2003 (gmt 0)

Thanks for all replies. The reason I put the question is that I have a robot trap in place. It should be triggered by download agents such as Teleport or wget for instance, but not by an (innocent) proxy.

In the html resources provided I couldn't find an explicit answer to my question - will an Internet proxy load pages in disallowed directories? These pages are pages a normal user would never call (they are linked to by hidden links).

bird




msg:1525265
 8:29 pm on Feb 2, 2003 (gmt 0)

A proxy doesn't do anything on its own initiative. It just forwards requests for its users, whether those are bots or humans. Checking for robots.txt is in the responsibility of the robots among the proxy users, and the proxy itself really shouldn't care about it.

jdMorgan




msg:1525266
 8:33 pm on Feb 2, 2003 (gmt 0)

Scooter24,

Technically speaking, a proxy is not an agent, and neither is a cache. From your site's perspective, you can view them as "pipes" with an active agent - a browser or 'bot - on the other end. A request from a caching or non-caching proxy will only be seen in your logs when there is a 'bot, a browser, or some other active agent requesting the resource on your site.

So the answer is that a cache or a proxy will only request a resource when it has received a request from a 'bot or a browser to do so. There may be exceptions to this, but in practical terms, not many. They will not respect robots.txt or any other content, because they don't read it - they just pass it through to the requestor, and in the case of caches, keep a copy in case the same resource is requested again soon. A cache will read and interpret the HTTP response header, but not the content of any requested resource. So, you can control caching by use of the cache-control directives in the HTTP response header, but because the "robot control information" is the content of robots.txt, a cache or proxy will never even look at it.

That explanation could be clearer, I'm sure, but it's the best I could come up with right now.

Jim

Scooter24




msg:1525267
 11:52 pm on Feb 2, 2003 (gmt 0)

Ok, so if I understand you properly a cache or proxy will not automatically download an entire site. Should this happen, it would be caused by a robot or download agent.

But isn't there something like "pre-fetch" in the cache world, i.e. if a user requests a page and the page is not in the cache, the cache will not just fetch that particular page from a site, but an entire "contiguos block" of pages - whatever that means?

brotherhood of LAN




msg:1525268
 12:06 am on Feb 3, 2003 (gmt 0)

>>Ok, so if I understand you properly a cache or proxy will not automatically download an entire site. Should this happen, it would be caused by a robot or download agent.

yes, it will be a user or "spider"......Im not one for explaining it concisely, but its all to do with the web being a "stateless" environment, the only protocols proxies behave AFAIK will be HTTP x.x....but then again im no server buff :)

This "stateless" environment will only react once a user requests something across it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved