Do proxies and Internet caches respect robots.txt?

Forum Moderators: goodroi

Message Too Old, No Replies

Do proxies and Internet caches respect robots.txt?

Scooter24

6:34 pm on Feb 2, 2003 (gmt 0)

Will an Internet cache or a proxy load pages in directories disallowed in robots.txt?

bird

6:45 pm on Feb 2, 2003 (gmt 0)

A proxy/cache is not a robot, so robots.txt doesn't apply to it. All requests made through those types of systems are supposedly caused by a human clicking on a link somewhere.

Robots are programs that crawl through pages and follow links without (or with very little) human interaction. Robots.txt is one of the tools that prevent those to run into loops and other traps that a human would avoid intuitively.

jdMorgan

6:55 pm on Feb 2, 2003 (gmt 0)

...And to cover the other half of the question, try this tutorial [mnot.net].

HTH,
Jim

andreasfriedrich

7:42 pm on Feb 2, 2003 (gmt 0)

Now compare what´s been said here with that [webmasterworld.com] ;).

Andreas

Scooter24

8:12 pm on Feb 2, 2003 (gmt 0)

Thanks for all replies. The reason I put the question is that I have a robot trap in place. It should be triggered by download agents such as Teleport or wget for instance, but not by an (innocent) proxy.

In the html resources provided I couldn't find an explicit answer to my question - will an Internet proxy load pages in disallowed directories? These pages are pages a normal user would never call (they are linked to by hidden links).

bird

8:29 pm on Feb 2, 2003 (gmt 0)

A proxy doesn't do anything on its own initiative. It just forwards requests for its users, whether those are bots or humans. Checking for robots.txt is in the responsibility of the robots among the proxy users, and the proxy itself really shouldn't care about it.

jdMorgan

8:33 pm on Feb 2, 2003 (gmt 0)

Scooter24,

Technically speaking, a proxy is not an agent, and neither is a cache. From your site's perspective, you can view them as "pipes" with an active agent - a browser or 'bot - on the other end. A request from a caching or non-caching proxy will only be seen in your logs when there is a 'bot, a browser, or some other active agent requesting the resource on your site.

So the answer is that a cache or a proxy will only request a resource when it has received a request from a 'bot or a browser to do so. There may be exceptions to this, but in practical terms, not many. They will not respect robots.txt or any other content, because they don't read it - they just pass it through to the requestor, and in the case of caches, keep a copy in case the same resource is requested again soon. A cache will read and interpret the HTTP response header, but not the content of any requested resource. So, you can control caching by use of the cache-control directives in the HTTP response header, but because the "robot control information" is the content of robots.txt, a cache or proxy will never even look at it.

That explanation could be clearer, I'm sure, but it's the best I could come up with right now.

Jim

Scooter24

11:52 pm on Feb 2, 2003 (gmt 0)

Ok, so if I understand you properly a cache or proxy will not automatically download an entire site. Should this happen, it would be caused by a robot or download agent.

But isn't there something like "pre-fetch" in the cache world, i.e. if a user requests a page and the page is not in the cache, the cache will not just fetch that particular page from a site, but an entire "contiguos block" of pages - whatever that means?

brotherhood of LAN

12:06 am on Feb 3, 2003 (gmt 0)

>>Ok, so if I understand you properly a cache or proxy will not automatically download an entire site. Should this happen, it would be caused by a robot or download agent.

yes, it will be a user or "spider"......Im not one for explaining it concisely, but its all to do with the web being a "stateless" environment, the only protocols proxies behave AFAIK will be HTTP x.x....but then again im no server buff :)

This "stateless" environment will only react once a user requests something across it.