Forum Moderators: goodroi
Robots are programs that crawl through pages and follow links without (or with very little) human interaction. Robots.txt is one of the tools that prevent those to run into loops and other traps that a human would avoid intuitively.
In the html resources provided I couldn't find an explicit answer to my question - will an Internet proxy load pages in disallowed directories? These pages are pages a normal user would never call (they are linked to by hidden links).
Technically speaking, a proxy is not an agent, and neither is a cache. From your site's perspective, you can view them as "pipes" with an active agent - a browser or 'bot - on the other end. A request from a caching or non-caching proxy will only be seen in your logs when there is a 'bot, a browser, or some other active agent requesting the resource on your site.
So the answer is that a cache or a proxy will only request a resource when it has received a request from a 'bot or a browser to do so. There may be exceptions to this, but in practical terms, not many. They will not respect robots.txt or any other content, because they don't read it - they just pass it through to the requestor, and in the case of caches, keep a copy in case the same resource is requested again soon. A cache will read and interpret the HTTP response header, but not the content of any requested resource. So, you can control caching by use of the cache-control directives in the HTTP response header, but because the "robot control information" is the content of robots.txt, a cache or proxy will never even look at it.
That explanation could be clearer, I'm sure, but it's the best I could come up with right now.
Jim
But isn't there something like "pre-fetch" in the cache world, i.e. if a user requests a page and the page is not in the cache, the cache will not just fetch that particular page from a site, but an entire "contiguos block" of pages - whatever that means?
yes, it will be a user or "spider"......Im not one for explaining it concisely, but its all to do with the web being a "stateless" environment, the only protocols proxies behave AFAIK will be HTTP x.x....but then again im no server buff :)
This "stateless" environment will only react once a user requests something across it.