|HTTP Headers, Google's cache and international caches|
How best set your HTTP headers for both
| 9:51 pm on Sep 12, 2005 (gmt 0)|
I have an issue with an ISP cache in South Africa. It seems that they keep caching my home page for two weeks, and their clients complain to us that our site is outdated.
IIS allows you to set a content expiration on the site, however it add two headers:
I want to set the cache limit for the South Africa cache to six hours, but I am worried that Google will interpret this to it's own cache too, and refresh their cache of out site every six hours. That would be crazy. The question is what HTTP content expiration headers does Google interpret for their cache?
| 8:39 am on Sep 13, 2005 (gmt 0)|
I deliberately set relatively short cache times (~1hr for most dynamic pages) to get browsers to reload from time to time, and it does not seem to hurt (have been doing it since 1997!).
1) IE6 cacheing is completely broken. Sometimes it respects the Cache-Control: and Expires: headers (BTW, they are more complex than you think, look up their correct use on www.w3.org), and sometimes it resurrects versions of pages months older than the one it just retrieved. Just had a major problem with an IE6 browser on ME last night as it happens. So, for IE users, recite after me "when something funny happens, clear the cache, close *ALL* IE windows, and restart IE".
2) When the visitor is *clearly* a bot, I set a much longer expiry time (30 days+) with the Cache-Control and Expires: headers, and I add a "Revisit-After" header too (though I doubt many bots take any notice). I *also* add text to the foot of the page in some cases warning the user that they may have a stale page and to hit RELOAD.
3) You may also want to set the Vary: header if the content depends on something like the user's locale or browser, to help keep aggressive caches in check.
4) You can resort to cache-busting techniques such as appending "?rnd=largerandomnumber" to your URLs to force refetches (though turn it off for SE bots).
5) Your client's ISP needs to get its cache fixed or removed; reverse or "transparent" caches almost never are, and in the past I've threatened an upstream ISP with legal action if they didn't remove one they installed without notice since it intefered with security and tracking and was a form of "attack" and was not contractually permitted (they removed the cache forthwith!).
| 2:02 pm on Sep 13, 2005 (gmt 0)|
What particular cache headers do you use that work for you?
| 2:22 pm on Sep 13, 2005 (gmt 0)|
Here's one set (the set I serve to obvious spiders, such as G):
Expires: Mon, 12 Dec 2005 14:20:20 GMT
Date: Tue, 13 Sep 2005 14:20:20 GMT
I reduce the expiry (ie Expires and Cache-Control) to ~1hr for other vistors for most pages.
| 2:40 pm on Sep 13, 2005 (gmt 0)|
What if the bot changes it's host name? Is this something you always have to manage? I have a list of bots. Would you mind posting yours, and perhaps a link to a site with a comprehensive list?
| 3:50 pm on Sep 13, 2005 (gmt 0)|
There are at least three ways of detecting a bot:
1) IP address (or DNS name, by reverse lookup).
2) User-Agent string.
3) Referer string (usually absent for bots, though sometimes absent for human vistors too).
I don't use (2) since it is easily forged one way or another, or may simply change.
I use a combination of (1) and (3), with almost no manual maintenance except to react to warnings in my logs every few months.
| 4:06 pm on Sep 13, 2005 (gmt 0)|
Thanks! Would you mind posting your list of IP's, and perhaps a link to a site with a comprehensive list?