Why webserver stats are questionable

Forum Moderators: DixonJones

Message Too Old, No Replies

Why webserver stats are questionable

Brett_Tabke

2:54 pm on Jun 20, 2001 (gmt 0)

[goldmark.org...]

A nice reality check for those of us that put too much faith in log file reading.

Bentler

12:30 am on Jun 22, 2001 (gmt 0)

Good article and I agree with the author's main point that log statistics are inaccurate for a number of reasons. I think it might be a little outdated though-- the note at the bottom says it was published in 1995. Now, stats are probably even less accurate than when it was published--for example, AOL now uses caching to serve up pages to its customers, and a number of caching servers are available to reduce network traffic for many sites.

However, I disagree that stats are useless for anything other than measuring server load-- to assert that relative importance of pages can't be inferred ignores probability. Probability does not require exactness, but it does require logic. The suggestion that one visitor might go to the same page 400 times, for example, is extremely unlikely, much less month after month after month. Likewise any suggestion that pages indicating more visits might be cached less often than those indicating fewer visits is illogical.

As I see it, because of caching, it's a certainty that log statistics undercount actual visitors-- the author glossed over this overwhelming probability by suggesting 1 person could visit a single page 400 times. Sorry, but that example is just not realistic.

evinrude

12:53 am on Jun 22, 2001 (gmt 0)

>the author glossed over this overwhelming probability by suggesting 1 person
>could visit a single page 400 times. Sorry, but that example is just not realistic.

I don't think the author was suggesting that. If I'm grabbing the part of the page you are refering to, what he is suggesting is that you can infer AT LEAST one user hit the page. Perhaps one of them hit it 20 or so times. You can't infer all 400 hits came from 400 different people.

Bentler

1:10 am on Jun 22, 2001 (gmt 0)

I agree that you can't infer 400 hits came from 400 different people-- what I understood the author's point to be, though, was that you can't infer 400 hits came from any more than one person.

<<If by minimum you mean "at least one" then yes.>>

In purely mathematical terms, his is a true statement-- you can't know that any more than one person visited. My point was that, considering probability, this scenario is extremely unlikely. The author uses this premise to justify his conclusion that usage statistics are only useful to understand server load and nothing else, when really the conclusion should be (I think) that usage statistics are useful to calculate server load exactly, but that some other useful, informative, decision-feeding measures can be derived inexactly.

Maybe I read it wrong though-- it wouldn't be the first time. The piece did seem a little harsh and maybe that affected how I interpreted the author's words.

In any case, I'd be interested in a statistical analysis comparing cookie-based visits to IP-based visits for a large number of pages. Anyone here know of such a comparison?

chiyo

5:51 am on Jun 22, 2001 (gmt 0)

It's a good reminder. On the Analog site they do have what i feel is a more useful page on what log programs can and cannot do. (I have refereed to this before on WMW somewhere).

Overall, the article seems a bit gloomy. We use logs stats for:

1. Checking up on what pages/sites are linking to us

2. Relative growth in popularity of certain pages over time. (We disagree with the ustor here that the stats are useless on this point. They are flawed but still indicative.

3. Rough unique visitor hits, (also aware of cacheing probelms)

4. Agree that actual hits are useless. We all know that. But using these stats are not much worse than the "subscription" or "readership" frigures used for years
in traditional publishing as an accepted measure. They too are subject to enormous margins of error.

That said, most of our hits come from spiders and hits to our XML news feeds, emal harvesetrs etc etc.. The actual pages "read" is really an unknown. People claiming "page reads" need to be asked how many are coming from these automated programs, how many of these are actual content (and not auto loading pages, pages with no or little content etc.)

Bolotomus

7:13 pm on Aug 18, 2001 (gmt 0)

I think this author is right on the mark.

Web stats don't even begin to do what we'd like them to do.

Bentler you say,
> As I see it, because of caching, it's a certainty that log statistics undercount actual visitors

Wait a minute--what's all this about visitors? The author cleverly did not use the word 'visitor' once in the document. And for good reason! The term 'visitor' has no good definition in terms of web-stats. When you're talking visitors, then the situation gets even worse, unless you have some special mechanism to track your users like generating a unique session ID for them.

Suppose you see that 10,000 requests came in to your server for page xyz.html. Of these requests, there are 7,000 unique ip addresses. You have no idea how many real visitors saw that page. It might be less than 10,000 ... or it might be less than 7,000 ... or it might be over 20,000. In theory, the true number of visitors could be as little as "1" and as many as "the population of the earth." How's that for an error factor?

In the case of visitor tracking you have other factors that effect it the other way, e.g., proxy servers. One one hand you have 100 people who appear, to your server, as a single user. On the other hand you have a 1 person who appears, to your server, to be 100 people. So in the end, you don't even know if your webstats are above or below the the truth.

Bolot

Bolotomus

7:15 pm on Aug 18, 2001 (gmt 0)

Related thread
[webmasterworld.com...]

Bolotomus

7:29 pm on Aug 18, 2001 (gmt 0)

An anecdote: I did work on a site for a few years and the owner was planning on selling it to a big corporation that wanted to get in on the online-news biz. So the owner was concerned, when he made his sales pitch, that we weren't lying about the stats. You could go to jail for fraud if you said the site had 10,000 visitors a day when it was really only 5,000.

He wanted me to sign an affidavit saying something like "The site averaged 9,820 visitors per day during 1999" as being "absolutely 100% correct in my professional opinion."

I told him, "no way--that figure could be totally false." Then he suggested that we modify the number, so that it reads "9,820 (+/- 5%) per day." I told him, "still no way. you don't know whether to say 5% accuracy or 25% accuracy, you just don't know." That brought up the question: with what accuracy do you know for sure that your visitor count is correct?

The site owner, no doubt, concluded I was just being obstinate. "How could this guy be right," he thought, "if he was, then these companies like Webtrends are just selling snake-oil?" Snake oil indeed, the kind that can grow hair on a turnip.

But it's only dangerous snake-oil if you place your faith too deeply in them. If you understand what the software is really doing, and know which portions of its output to take with a grain of salt, then it's quite useful information.

Bolot

reflex

1:45 pm on Aug 23, 2001 (gmt 0)

Since we are talking a little bit about word usage, I'm not a big fan of the word 'hit' and the way it's used so loosely in web terminology.

But, in web stats doesn't the individual ip addressed count for something?

zoidberg

9:04 am on Sep 1, 2001 (gmt 0)

Something I do to corroborate stats is put a simple javascript call on the home page for a site. It calls a cgi program that logs the IP and the time the page was loaded. Remove any duplicate IPs and it should give you a pretty accurate count of "visitors", (I think?).

Usually it's about 60% of what Webtrends reports.