Forum Moderators: DixonJones
I'm thinking perhaps of javascript on the page... something to "phone home" even if the page is cached. Or... would a system like WebTrends Live (which, as I understand it, does use javascript on the page) report cached page hits?
I use Apache mod_headers directives in my .htaccess file to mark various files as long-term
cacheable, short-term cacheable, or non-cacheable. Since the cache control tagging is done in
the http response header, there are no issues with not being able to include cache control
directives in non-html files.
# Set http header cache expiry dates
ExpiresActive On
# Default - Expire everything 1 week from last access
ExpiresDefault "A604800"
Header append Cache-Control: "must-revalidate"
# Apply a customized Cache-Control header to frequently-updated files
<Files index.html>
Header unset Cache-Control:
ExpiresDefault "A14400"
Header append Cache-Control: "must-revalidate"
</Files>
<Files phonehome.gif>
Header unset Cache-Control:
ExpiresDefault "A1"
Header append Cache-Control: "no-cache, must-revalidate"
</Files>
The above snippet has been modified to add the "phonehome" entry; I have not personally used
this approach to make AOL-cached page accesses visible, but it does work for my purposes.
After changing something in the headers like this, I use the WebmasterWorld header checker and
also the cacheability checker tool at ircache.net to make sure it's correct.
Jim
On jd's reply, I'm not server-literate, but I can sort of follow the code. If it came to implement this, I'd hope to be working with a qualified IT person. Am I correct that the link to phonehome.gif on the page would be to an absolute url (ie, to the file on the server)?
On both replies, I realize I don't understand the mechanics of how AOL caches a page. My phrasing of the following question is also going to indicate how little I know about the routing of the web.
I assumed that caching by AOL involved kind of a redirect or interception... and that if your url request came through their system, they'd deliver the page from their cache before the request ever got to your server. If so, I'm not understanding how either of these schemes to disable the cache would work... though I'm not disputing that they do. I just don't understand what's happening.
>>It forces (or at least is supposed to) the browser to get the HTML from the original server.<<
If the "expired" tag worked, wouldn't it simply be preventing the browser from caching the page... and wouldn't a request for the page from AOL simply end up going to the same AOL cache that was storing the page in the first place? Again, I'm asking the question just for clarification of my own very imperfect understanding.
Also, would this 0/expires tag cause the page to have to reload in the browser each time the page is accessed during a visitor session (in other words, is it a browser no-cache)? I can see where this would could be a significant problem, whereas a few extra seconds to access the page on the web might not be.
Yes, the reference to "phonehome.gif" refers to the 1x1 .gif that you use as a flag that someone has
visited, even if all (other) pages were served from a proxy cache. In the environment that .htaccess
"runs in", this is just a local server file path; there is no need for a canonical URL of the form
"http://www.mydomain.com/phonehome.gif". In other words, if you name the gif file phonehome.gif and
put it in the server directory with the .htaccess file including the mod_headers directives, it will
work as shown.
There can be many caches in the path between your site and the user. There's the user's browser
cache, perhaps his corporate gateway has a cache, possibly another one at the ISP facility that
provides internet access service to his company, then there's the AOL cache, and maybe even more...
All of them simply save copies of requested pages (and other page-included objects) that pass
through them. If the same page is requested again, it can be served from the cache rather than
continuing the process of actually connecting to your server. This speeds things up and reduces
internet backbone traffic. [I'm just going to use "page" instead of object, OK? - It's a much more
familiar term. I mean a page and all external scripts, images, etc. that it includes. Although
each object is cached separately, it's just easier to write and read "page".]
However, some mechanism must be provided to flush old pages out of the cache. Otherwise the user
could be served with a stale copy of the requested page. In most cases, this happens anyway, as
the capacity of the cache is finite; newly-requested pages (different pages) replace the old, and
so the old ones get flushed out over time. However, that still leaves a hole. What if a file gets
requested 1000's of times per day? It might not seem to be "old" in the simple system described so
far. So, cached files must be tagged with various control fields to indicate how fresh they are,
and to an extent, how critical it is that the end-user be served fresh content. Some of this
"tagging" takes place in the cache itself, and some of it can be controlled by the webmaster. In
the cache, stored objects are simply tagged with the time they were first saved in the cache, and
they then "age out" of the cache over a period of minutes to days - as determined by the cache
administrator. In the absence of any information provided by the stored object itself, this is
what happens - they hang around until the cache decides to flush them or until they get overwritten
by other requested objects. Thus, the fate of your page is decided by someone else unless you
provide the caches with information needed to make a better decision. Sites which do not are the
ones where you always have to manually do a forced browser refresh to get a non-stale page - the
old "Shift-Reload" or "Control-Reload" trick...
Better control can be accomplished with the cache-control http headers shown in the mod_headers
example I provided. Set the expires header to the longest possible period of time that you are
willing to have an old copy of the page hanging around in someone's cache. Longer is good to
make your page appear to load fast - instantaneously if it's in the user's browser's cache - and
not so fast if it's "further away" but still in a cache somewhere down the line. A shorter time
should be specified for pages (or objects) which MUST be kept fresh, at the cost of requiring
them to be loaded from your server every time. However, be careful not to set it too short, or
your users WILL see an apparent slow-down accessing your site!
Anyway, the example mod_header directives will mark the original phonehome.gif file on your server
with an expiration time of one second since it was last accessed, and specify that all caches must
check to make sure that it's contents have not been modified since the original file was last
requested. If it is older than one second, or if a conditional "GET" indicates that it was modified,
then a fresh copy will be fetched, and all caches along the line will (should) update.
So, the end result is that you will see every single time the phonehome.gif file gets loaded from
your server, and in almost all cases, that will indicate every single time the page it is included
on is requested from any properly working and configured cache. This will go a long way to improve
the accuracy of your AOL-user hits. What you won't see in your logs is useful referer info, since
the referer will always be the page that the image is included on.
A good resource is the cacheability checker, available here [ircache.net] under
"Tools". Also, read the "How to" article cited on the cache tester page itself.
Sorry for the long post - I hope it's useful.
Jim
Am I correct in thinking that the following placed in .htaccess would prevent Google (& others) from caching content more than a day old and will force a refresh of ALL files from the server?
# Set http header cache expiry dates
ExpiresActive On
# Default - Expire everything 1 day from last access
ExpiresDefault "A86400"
header append Cache-Control: "must-revalidate"
# Apply a Cache-Control header to all files
<Files *.html>
Header append Cache-Control: "must-revalidate"
</Files>
Does it matter if there is other code before this in the .htaccess file and if so, whats the syntax to separate the different code segments?
Also, do the individual html pages (or anything else?) need to reference the .htaccess file or does a search engine check .htaccess everytime it requests a page?
Thx
J
Since you have specified "*.html" in the <Files> section, you really don't even need the
<Files> section, since the default settings above it will take care of *.html.
You can put the cache control section anywhere in your .htaccess file.
The ExpiresActive On statement effectively demarcates this section.
Robots don't "read" .htaccess files, but your server does. The server looks at each .htaccess
in the path to the requested file before serving that file. So, it starts at your root directory,
reads .htaccess there, and then if the requested files is in a subdirectory, the server will
read the .htaccess file in the next directory level down from root, continuing until it reads
the .htaccess file in the same directory as the requested file. Directory-specific settings in
.htaccess files in lower-level directories can override more general settings in the .htaccess
files in higher-level directories, or they can "inherit" access settings from them. It is also
OK if there is no .htaccess file in any or all of the subdirectories - lots of sites get by with
only one .htaccess file in root. But if you want to "change the rules" on a per-directory basis,
you can.
The most important point is that the server reads all the .htaccess files in the path of each
requested file, whether that file is an html page, an included graphic, or whatever. So a
request for one page may result in the server reading many .htacess files, and it doesn't
matter if the page is requested by a human using a browser, or a search engine robot.
I hope that answers your questions.
Jim
I just realized we probably posted at the same time last month! Any luck?
Also, would this 0/expires tag cause the page to have to reload in the browser each time
the page is accessed during a visitor session (in other words, is it a browser no-cache)? I can
see where this would could be a significant problem, whereas a few extra seconds to access the
page on the web might not be.
Yes, the expires tag is going to force a reload of the tagged object each time. That's why you
want to use a very small file (like a 1x1 GIF) as your "beacon" and only load it once per page
- in other words, if you already use a 1x1 GIF as a spacer, use a different file name for your
phone-home 1x1 GIF.
Jim
jd - Because of the simultaneity, I had missed your post entirely until today. Thanks... it's a great post, not too long at all. I haven't gotten any input beyond this thread, and nothing's been implemented yet.
Some more questions... if the phone-home gif were not cacheable, would I need a unique phone-home gif for every page so that I could track page hits? Or is there log info that identifies the page without having unique gifs?
Also, would the network and server access time to grab the phone-home gif slow down the page load? I assume it wouldn't be anything like the load time of the whole page, but I'm guessing that there's a bunch of network and server overhead that's independent of file size. Any idea how much?... or am I splitting hairs.
A way I have done this in the past is to attach a random querystring key/value pair (which your web server will ignore) to the end of each 1x1 gif request.
If you are using 1x1 pixel gifs then some simple JavaScript (ala WebTrends Live) is a perfect approach.
Let me know if you want an example of what the JavaScript code might look like.
Well, you MAY need a unique GIF per-page you want to track. But this depends on your logging set-up
and how you "like" to analyze your logs. The referer for this GIF will be the page that requests
it - not, for example, a search engine referer and its associated search query string. So, you can
tell which page called the common GIF file, and therefore this tells you which page was requested
and possibly served to the user from a cache. So, you lose info about the real referer, but you do
get an indication of which page the user loaded, so the method is still useful to get more-accurate
visitor counts.
As jm_uk points out, you can also use unique request strings to do cache-busting if you cannot get
control of the cache-control headers due to server issues. One way to "share" a GIF file between
pages and do cache-busting is to append a query string to a common GIF filename, like:
img1x1.gif?page_name=my_index.html&time=1021167298 - using the raw server time string as a
cache buster, and also specifying the refering page name if you like. This might make sorting out
your log files easier, and since the time will be different on each request, also busts caches. If
possible, use SSI rather than JavaScript to generate the unique query string - just in case your
visitor has disabled JS.
As far as slowing things down, yes, there will be a slowdown - but probably only detectable on
satellite ISP services. In this case, there would be about a 0.5 second delay to "finish" loading
the page, since the GIF would always be loaded from the server. This delay is almost 100% air-time
by the way, since the data has to travel 89,200 miles to/from the satellite. The other delays, and
delays on terrestrial internet connections, is likely to be 0.2 seconds or less. And remember that
this is a 1x1 transparent GIF, and so will not be visible on the page anyway. Thus, I doubt you
users will even notice it.
I'd suggest picking a couple of low-traffic pages to experiment with this method and see how you
want to do it and whether it is useful to you. It's really fairly easy to do.
Jim
jd - The unique gif per page means that this won't work on large sites, particularly data-base driven sites, but I'm hoping it will work well on some smaller sites where tracking is an issue. Thanks....