Forum Moderators: DixonJones
We all know that it is almost impossible to safely re-identify a (unique) visitor by IP and timestamp. So AFAIK, a "new" visitor is generally assumed by these programs, if there is a no-action-pause on the client's side of more that half an hour or so.
So my own script necessarily deviates from normal logfile-analysis: the major difference is, that in case no php-session-variable is sent by the visitor, I check the last THREE HOURS (instead of just 1440 seconds, which is the default value of php session management). We have a bit more than 2k uniques according to the official stats, so I believe the probability is very very low, that a new visitor receives the full shopping cart of the predecessor, who incidentally came just two hours ago under the same IP;)
The different values almost struck me to the floor: My own stats say that we have just 1.5-1.6k visiotrs a day instead of 2.2-2.3k. This is roughly only 70 per cent.
All this is quite new, data covers only a week or so. It may also be the case that there is something completely wrong with my script, but I doubt so, because we receive orders as usual and had no complaints yet, which wouldn't be the case if there were any serious inconsistencies in the script.
Can anyone confirm or disprove this data? Do most statistics lie about the numer of uniques, because far more visitors than assumed often leave their browser-window open over lunchtime?
Research has shown a gap of several weeks from initial research to final purchase.
I can't disbelieve your figures for the discrepencies though.
I wonder why the sessions are normally limited to 1440 seconds. There are 1440 minutes in a day. Are you sure the limit isn't 1440 minutes?
If you want to identify individual people's long-term-behaviour, yes, you should. But I don't target such a sohphisiticated level, yet, and I think the stats programs just tell you your uniques per day. They don't work with sessions, but rather logfile analysis, so there's no chance these programs work over months. (logfiles have to be deleted after six weeks by law here in Germany).
> Are you sure the limit isn't 1440 minutes?
No. I am not sure. Maybe, but the german translation of the php-manual says this php.ini entry defines 1440 seconds as the default value for garbage collection, which is a bit less than half an hour. And I recall vaguely reading some explanations on logfile-analysis, where a similar figure has been mentioned insofar as half an hour is the timespan where these programs assume, a new visiotr has arrived.
I am far from being an expert on this. But 30 per cent is a lot.
But I must correct and explain myself a bit:
the main reason why I perform those comparisons of timestamps at all, was, that I did not want bots with cookies disabled to set a new entry in my dtabase on every request. So the logic roughly runs as follows:
1) ONLY IF no session ID is sent, compare timestamps, otherwise take seession ID, query data (shopping cart) and go on. This supposedly will work for more that 99% of all returning human visitors.
If no session ID is sent by the visitor's browser, this may either be a new visitor or someone with cookies disabled or someone who had left his browser window open for more than php-session-expiration time.
2) IF the last request from this IP was within the last THREE HOURS, query referrer from the entries.
3) IF this last request was less than TEN MINUTES ago, request session ID from the entries and increase pagevisits by one. Otherwise generate a new session ID. This mainly helps to identify bots and as a side effect enables human "paranoids" with cookies disabled to put things into my shopping cart;) For instance googlebot from 66.249.65.136 has a bit more than 200 entries with 3300 'pageviews' in total in the past seven days, so actually the number of true uniques should even be reduced a bit.
4) IF the referrer queried comes from my own site, this is either someone who left his browser window open over lunchtime or someone who faked referer for the pourpose of a hacker attack. So I request old session-id and present him the full shopping cart. So far about five to six out of 1500 per day on average ).
Now, having re-read the logic of my script once again, I come to the conclusion that the numer of uniques is even smaller than those 70% given above. I assume that general logfile analysers take perhaps five instead of ten minutes for returning IPS, which would be in accordance with your remarks on random distribution.
But this way googlebot alone would show up as probably 100 visitors every day in my hosters statistics. And bots make up a really great percentage of total traffic on smaller websites like mine.
I am quite sure at the moment, that my analysis does not cover serious mistakes or any logical inconsistencies. The question is:
What figures will such an analysis reveal on sites with far more traffic?
My data says something different, and I am still well able to take a very close look at single entries, be it my database or the raw logfiles. To tell details would violate the TOS, but I doubt someone is going to read articles in the New York Times or to follow a regex-tutorial-page in less than five minutes.
A tremendous amount of traffic nowadays comes from search engines and other bots, particularly google. And I am not the only one who reported so. These machines show a very particular behaviour e.g. don't go out for lunch in general, do not pass session IDs, and DO seem to read a 20KByte-report in a few seconds. This traffic influences statistics up to fifty per cent and sometimes even more.
> You need to deal with average page viewers, not those with your profile (or mine).
What I need is data about my customers as accurate as possible. The interesting ones among those (i.e. those with the money in their pockets) are searching for something to escape averageness.
I've run another analysis of a section of my data, recording intervals for each URL separately. I
get 10,000 URLs over a period of 344 hours before I run out of memory in this program version.
The result shows that the intervals are random beyond 20 minutes for each URL separately.
Most requestors, of course, don't have static URLs, but get a new one from their host's block
assigned to them each time they make log on. And, hosts have to recycle user URLs after a dead
time so they don't run out when users stay connected permanently.
So, my result means that, after an interval of 20 minutes from the last request, I can't tell the
difference between the same user on the URL or someone else re-using it, nor between one user
on one URL and the same user on a different URL.
No one composes messages on my site, nor do I have any forms to fill out. If your site does
either, then your random point may be a bit longer. But, this is how you tell what a unique user is
on a site.
Then you need cookies.
I'll try another analysis looking to see how often a URL changes agent field, which of course almost always means a change of user.
To remove bots, see
[sankey.ws...]
for the commonest ones I get hit with.
The rate of replacement is less than I expected. Using URL plus agent as your ID seems to allow
a fairly long connect period.
Now all I have to figure out how to do is to bypass my compiler's 16-bit-based string space limit
to try to use it here for real.
As I said, that was only a first glimpse of the potentials of detailed referer logging, a side-note of my findings.
Meanwhile I elaborated the code a bit further, showing me the exact number of visitors coming in via google organic, google adwords, other search-engines and also ordering-visitors typing in urls directly (i.e. no referer at all) or searching for my company name. I did not think, yet, we had so many returning customers. Makes me quite happy.
I also have details to each of my landing-pages and the search-phrases the visitors used (for each page), which gives me an interesting basis for keyword analysis and the adwords campaign I have just started. I implemented search-functions for the landing pages and may receive the results sorted by visitors, number or amount of orders, average order, orderspervisit or europervisitor, the latter being the most interesting one for calculating ROI and limits of adwords-campaigns.
> ...data that doesn't need to be collected.
Collecting data can hardly ever be wrong. Look at google: Knowledge means power. If standard statistic-programs (at least the one my hosting company uses) are so utterly wrong about the number of unique visitors on websites, this has severe consequences for many companies and projects.
Whether people want to hear this, is a different matter.
Now, after a fourtnight of data, I'm absolutely convinced my log-script has no serious mistakes.
Oliver is right: knowledge means power - in my case the power of
understanding the world I live in. But, I think of it as:
science is curiosity.
[sankey.ws...]
now has the data to show that.