|Do web caches invalidate web logs?|
Web logs are powerful, but can they ever be truly accurate?
I am extremely skeptical of some of these packages like Webtrends which try to give you counts of 'visitors,' and 'average session length.' Their techniques must be hokey at best. But after an experience last year, I've become highly skeptical of ALL logfile analysis systems.
About a year ago my network admin (the guy who runs the place that I co-locate at) told me that Bell South was upgrading all of their routers, and the new routers would have huge caches, and this should cut down on overall bandwidth usage.
Damn, he was RIGHT. Just about every busy site we had dropped off about 20-30% in terms of overall traffic. The more obscure sites, well, they remained obscure. This router took the liberty of grabbing our HTML pages and wouldn't bother my server when somebody wanted the page.
This incident put the whole meaning of web logs into a different perspective for me. To see that the site usage dropped off at a certain point is really meaningless! So any graph showing you "number of accesses per day" should be taken with a grain of salt. Even if it shows a clear trend upward, for all you know you're actually losing visitors. Or it might show your traffic get cut in half when in reality you've just hit a new record high in terms of page views.
To make matters worse, it's not just one cache you're up against. Routers all over the globe use cache tricks, not to mention firewalls, and of course, browsers.
And so, while I still rely heavily on web logs, my new attitude is to interpret them as what they really are: an account of the server's workload, not the end-user's experience. You don't know how many people came to your site, and you don't know which page was seen the most often, and you don't know whether traffic is getting bigger or smaller.
One of the reasons behind this is imho broadband - sure the ISP can give you fast access between your house and their network but traffic beyond their network costs them big time, hence caches.
Sadly the caches sometimes work against the visitor aswell, either showing them old pages or if the cache fails then a whole lot of people suddenly can't surf. Some really badly behaved caches even cache dynamic pages!
added: oops sorry missed the punchline to my point - as broadband becomes more prevalent caching will become more of a problem :(
If that is the case... what is there to do? Drop the WebTrends reports that we have or try to go in there and reconfigure?
Invalidate is a strong word- I think weaken is more realistic. You can't know the absolute number of visitors or the actual time spent online, for a number of reasons including ISP & browser caching, the back button etc. But you can derive good comparative info about a site like customer preferences for certain topics, seasonal variation in usage of certain pages, relative amount of time spent on certain pages, etc. I also use log analysis to identify pages accessed from bookmarks, identify directories where favicon.ico was grabbed, identify and measure SE and other referrers, list most common errors, gauge browser usage, identify common keywords used to access the site from SE's, list common keywords to access pages from our internal SE, identify peak usage times, and other things.
This info is useful to reorganize a site to emphasize certain topics and at various times of year, improve weak content, design and language, improve site exposure of difficult to find but sought after pages, prioritize site repairs, target SE's with weak referrals, time announcements and news, and other activities that require priority setting based on a feedback loop. It also provides a decent measure, on average, of the general effectiveness of a site and the interest in it.
Caching should lop off peaks on relative page usage, but from my experience, there's always a correleation between most referred, most bookmarked pages and highest page usage. There's also always a correlation between the most compelling content and highest average time spent on a page-- despite caching and the back button that weakens absolute measurements, composite results from numerous visits provides very good information. From what I've seen personally.
I agree mostly with what you are saying, but I think you still may place too much faith in the statistics. When you say "good comparitive information" I assume you mean, for example, that you know whether people visit page abc.html or xyz.html more frequently.
Aha! But that's exactly where caches can confound the issue. Perhaps on your site, the page xyz.html is extremely popular and the page abc.html is only moderately so. A web cache may decide to cache your xyz.html and not abc.html. Therefore, your logfiles show that MORE people came for abc.html, even though xyz.html's popularity is soaring. So much for 'comparative information' in that case.
Still, the web logs are priceless. And what's more, there are tactics around this issue. E.g., an ecommerce system that has complete user tracking (the user gets a number that gets passed from request to request) can know exact numbers of visitors, page views, etc.
Well, I guess what I mean is that I measure usage from several different angles. For example, pages that rank high on google, yahoo, av, and aol SERPs also list high in their referrals, meaning that many (or most) of these requests aren't intercepted by caches. Also, sites with lots of favicon requests show more usage than sites with fewer, as do sites that show more bookmark referrals. When a url is listed in a newspaper article, usage on the page spikes like crazy, telling me plainly that caching isn't affecting usage much or if it is, then usage is actually much higher than I'm measuring.
Other cross references include seasonal usage changes-- for example, usage on subsites that provide water temperature measurements and bacterial pollution at swimming beaches climbs steeply in summer. Usage of a farmers market & u-pick farms directory increases during our local harvest season. During flood season, usage of the flooding pages spikes way high. During spring, lots of visits to recycling and materials exchange pages. Surprise surprise. If caching were such a strong influence, I wouldn't see these increases -- I'd actually be more likely to see decreases in usage when pages are getting more visits.
I understand mathematics very well, and I know these are independent measurements that correlate well. They tell me that indeed, the pages that show up in my statistics as getting more business do get more business. Based on personal observation, I do have a certain amount of faith in the numbers.
That said, I do know caching is going on and that it does affect the numbers. The visits are not an absolute count. I do think that usage spikes get chopped off-- they're just smaller spikes as a result though, the spikes don't just disappear...they still show more usage relative to flat pages.
Along these lines, one thing to consider in how caching affects usage logging is that actual usage is always greater than what's measured. This isn't a trivial thing-- with caching, visit counts underrepresent actual usage.
One area where I think WebTrends data is totally off the wall is the user session length. One big real estate site I work with, routinely records an average user session of 10-15 minutes, with about half the visits being single user sessions.
I can't think of a passive type site, where the average visitor would spend 15 minutes (which with so many single user sessions tanslates into more like 30 min for most people who stay). A content site or forums like WMW, Internet.com, I could see, but not a real estate site, where the average visitor session for over 300,000 visitors a month is 10-15 min.
You can use "no cache" headers if you want. All the good cachers follow that standard (I use it here for anything with proxy or aol in the name). I don't use it system wide because some search engines will see that and not index you.
Logs are best viewed as a "snapshot" of what is occuring. You can use counters that have agressive no chache tags (including expires) that work pretty good. You can see them in action with any banner code or good counter code.
I've done a great deal of research on logs, and if you take the right steps, you can get 95-99% of your visitors. I've done some tests on sites where every url, page, and link is dynamically generated for each page. That alone will tell you who is caching and the depth of their caching.
Brett, you are 100% correct, I usually see about 95%-99% accuracy as well, but that's overall. I have seen cases of individual pages which were heavily caches and their results were grossly out of proportion; e.g., a splash page was cached and the home page was not.
If a webmaster tells me "This month was the busiest month ever for my site," I have to believe him. (In theory even that statement might be wrong, but come on, let's get real.) However... if he says "Page xyz.html is the most requested page on my server," or "Exactly 800 unique visitors came to my homepage today" then I think he's just fooling himself.
Also, destroying the cache for the purpose of logging. If your site has a real reason why it needs to block caches, e.g. is 100% dynamic, then fine, but otherwise I say let your site do its thing, and let the caches do their thing.
Cache's are great! Do you know how slow your computer would be without a CPU cache? Do you have any idea what a drag the net would be without your local hard drive cache? And router caches are becoming increasingly important too, as a means to taking the bandwidth load off the net. If somebody in Australia wants a web page, and the data they request is already over on that side of the globe, why should my server be bothered?
As time goes on, I hope to see more caching. It doesn't bother me in the least that the logfiles don't tell the whole story.