Welcome to WebmasterWorld Guest from 54.221.9.209

Forum Moderators: goodroi

Message Too Old, No Replies

Do proxies and Internet caches respect robots.txt?

     
6:34 pm on Feb 2, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Aug 2, 2002
posts:212
votes: 0


Will an Internet cache or a proxy load pages in directories disallowed in robots.txt?
6:45 pm on Feb 2, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 10, 2001
posts:1550
votes: 10


A proxy/cache is not a robot, so robots.txt doesn't apply to it. All requests made through those types of systems are supposedly caused by a human clicking on a link somewhere.

Robots are programs that crawl through pages and follow links without (or with very little) human interaction. Robots.txt is one of the tools that prevent those to run into loops and other traps that a human would avoid intuitively.

6:55 pm on Feb 2, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


...And to cover the other half of the question, try this tutorial [mnot.net].

HTH,
Jim

7:42 pm on Feb 2, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 22, 2002
posts:1782
votes: 0


Now compare what´s been said here with that [webmasterworld.com] ;).

Andreas

8:12 pm on Feb 2, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Aug 2, 2002
posts:212
votes: 0


Thanks for all replies. The reason I put the question is that I have a robot trap in place. It should be triggered by download agents such as Teleport or wget for instance, but not by an (innocent) proxy.

In the html resources provided I couldn't find an explicit answer to my question - will an Internet proxy load pages in disallowed directories? These pages are pages a normal user would never call (they are linked to by hidden links).

8:29 pm on Feb 2, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 10, 2001
posts:1550
votes: 10


A proxy doesn't do anything on its own initiative. It just forwards requests for its users, whether those are bots or humans. Checking for robots.txt is in the responsibility of the robots among the proxy users, and the proxy itself really shouldn't care about it.
8:33 pm on Feb 2, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Scooter24,

Technically speaking, a proxy is not an agent, and neither is a cache. From your site's perspective, you can view them as "pipes" with an active agent - a browser or 'bot - on the other end. A request from a caching or non-caching proxy will only be seen in your logs when there is a 'bot, a browser, or some other active agent requesting the resource on your site.

So the answer is that a cache or a proxy will only request a resource when it has received a request from a 'bot or a browser to do so. There may be exceptions to this, but in practical terms, not many. They will not respect robots.txt or any other content, because they don't read it - they just pass it through to the requestor, and in the case of caches, keep a copy in case the same resource is requested again soon. A cache will read and interpret the HTTP response header, but not the content of any requested resource. So, you can control caching by use of the cache-control directives in the HTTP response header, but because the "robot control information" is the content of robots.txt, a cache or proxy will never even look at it.

That explanation could be clearer, I'm sure, but it's the best I could come up with right now.

Jim

11:52 pm on Feb 2, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Aug 2, 2002
posts:212
votes: 0


Ok, so if I understand you properly a cache or proxy will not automatically download an entire site. Should this happen, it would be caused by a robot or download agent.

But isn't there something like "pre-fetch" in the cache world, i.e. if a user requests a page and the page is not in the cache, the cache will not just fetch that particular page from a site, but an entire "contiguos block" of pages - whatever that means?

12:06 am on Feb 3, 2003 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


>>Ok, so if I understand you properly a cache or proxy will not automatically download an entire site. Should this happen, it would be caused by a robot or download agent.

yes, it will be a user or "spider"......Im not one for explaining it concisely, but its all to do with the web being a "stateless" environment, the only protocols proxies behave AFAIK will be HTTP x.x....but then again im no server buff :)

This "stateless" environment will only react once a user requests something across it.