| This 54 message thread spans 2 pages: 54 (  2 ) > > || |
|How to track visitors|
An intermediate guide
| 3:05 pm on Dec 3, 2003 (gmt 0)|
It seems to me that there are far too many questions repeated in this forum, as people come in, usually finding their ISP's log analysis is one of the following:
1) non-existant 2)wildly innaccurate or 3) incomprehensible.
The first question they then ask is "what package should I use" and without naming names, this thread should tell you what to look for and why.
Logs Vs. Server Side (java tagging).
Let us be clear. Log files were originally intended to measure how much a machine was being used by other computers. The idea of measuring human behaviour is a relatively new one and not an activity ideally suited to log files. By comparison, client side tracking (usually denoted by java code on your web page which runs when a user loads the page) is disastrous at tracking robots, or working out if your webserver is getting too busy to cope.
Why do all tracking systems give different results?
Here are 5 reasons why logfile tracking systems scew up measuring users (I could think of many more, but these will be enough to confuse most bosses):
1) They identify "unique users" (generally) by the IP number that the user is arriving on. However, many people can come to the website in tyhe same day on the same IP if they go thruogh a dial up provider. This error will be more prominent on busy sites.
2) Some ISPs (infact, most) use proxy servers to cache pages. This means user A requests your home page. Ten minutes later, person B requests your home page, but it is delivered by the proxy server, not your website, so there is no record of this event on your logfile.
3) Even conversion data is ruined by point 2. This is because person B may then click on a link and lo... he appears to have started on an inner page on your site, when he didn't.
4) Point 2 also wrecks search engine tracking, since the user links to an inner page and lo... the referring domain appears to be an internal referral.
5) When a person returns to a website the next day, they will usually arrive on a separate IP number and thus be identified differently. Log file analysis cannot tell you how many times a person comes before they buy. Nor can it track returning visitors back to their original referrer.
So ... java tagging is better right? well... not necessarily...
Here are 8 reasons why java tagging systems scew up measuring users:
1) Java tags must go into HTML code, which means it requires ingenuity to track pictures, sound files, pdf files, or any other types of ownloads
2) The tag tracks browsers, not users. So if a person looks at your site from work, then from home, they are counted twice.
3) If a person looks at your site in Netscape, then in Internet Explorer (because you wrecked their user experience in Netscape) then they will be counted twice.
4) Java tags do not record spiders, such as googlebot.
6) The user may block all cookies if they choose.
7) If the server called up by the java tag is slow to respond, then the user may never be tracked, since they move from the page before the page load is complete.
8) As I sit writing this from a fancy web phone I have to add that some people do not see your site using java enabled browsers... they will not be tracked.
This said, Java tagging is MUCH better for most people to use, probably correctly analysing 90-95% of human users, whilst (IMHO) even the best log tracking can miss 35% of users on busy sites (though much more accurate on quieter sites).
My advice, therefore, has to be the latter approach from marketing perspective. So... even BEFORE saying how to select a third party product or not, I better explain what you must first do to make it legal to even use such a system... (yep... it will be illegal if you don't do the following)
On December 11th (2003) a new EU law comes into force that says you MUST tell users when you are using cookies to identify them, and you MUST tell them how they can opt out. My advice (and I am no lawyer, get legal advice yada yada) is to have a link on all pages saying "how we track" then have a page explaining that you track using cookies and that they can disable this in IE settings (explaining how). Actually, we track other things in the event that we cannot lay a cookie, so I intend to suggest that if they want to opt out, they better go through a site that disguises their real identity.
Choosing a java tracking system:
I cannot recommend any here, but consider these factors:
1) The BEST is a system that sits on your own server, thus eliminating any problems with third party cookies. For busy sites, this may also be the most cost effective, but is certainly the hardest to integrate and requires you to have (probably) a dedicated server.
2) The best THIRD PARTY systems will also be the most expnsive. Important is not whether it tracks visitors, but whether it can show TRENDS over time and CONVERSIONS by search engine, or keywrd or campaign.
3) Even the best third party software is dependent on its own server load and server downtime. So, the most popular may also be the least accurate. (Not saying it is... but saying it could be a victim of its own success). Also consider whether the stats server is 7,000 miles from your own users, as this could be a factor.
Hope this helps a number of people in this forum.
[edited by: Receptional at 3:34 pm (utc) on Dec. 3, 2003]
| 3:20 pm on Dec 3, 2003 (gmt 0)|
That's a great post Receptional :)
Here are a few useful posts with pointers i've compiled earlier:
Msg #9 in: Difference between a log analyzer and a stats software [webmasterworld.com]
Msg #9 in: Learning more about Googlebots behaviour on my site. [webmasterworld.com]
| 3:57 pm on Dec 3, 2003 (gmt 0)|
Nice post. A few comments.
> it requires ingenuity to track pictures, sound files,
> pdf files, or any other types of ownloads
> Java[script] tags do not record spiders, such as googlebot
This could be considered a weakness, but I personally consider it a strength. All the "noise" from robots, harvesters etc. is filtered resulting in increased accuracy.
However, this does suggest that a combination of client side and server side tracking is preferrable.
> people do not see your site using java enabled browsers...
> they will not be tracked.
Actually, even under these circumstances, the 3rd party tool would theoretically be able to track the same data as a log analyser (if implemented correctly).
> most popular may also be the least accurate
This would indicate a bad pricing policy more than anything
> that sits on your own server, thus eliminating
> any problems with third party cookies
Some 3rd party systems are able to use 1st party cookies. I think this will be more and more common among the web analytics providers.
| 4:37 pm on Dec 3, 2003 (gmt 0)|
Valid additions Claus and lundsfryd. Thanks.
| 4:46 pm on Dec 3, 2003 (gmt 0)|
There is a long long discussion about cookies and privacy in this usenet thread [groups.google.com] which I started there last week.
My conclusion is definitely that I will continue to block cookies except from specific mostly-login sites that I choose to allow cookies.
However I think most users will not be so informed as to make a choice at all for quite some time if ever. IE6 allows cookies by default, question is what will later versions do?
[edited by: DaveAtIFG at 3:27 pm (utc) on Dec. 4, 2003]
[edit reason] Fixed side scrolling [/edit]
| 5:08 pm on Dec 3, 2003 (gmt 0)|
My opinion on cookies is that the reputation is much worse than the reality. There are many reasons for my opinion, but it is based on the fact that cookies can only store information that you give them. Therefore, the focus should be on the general treatment of your personal information and not on the cookies themselves (they are just one of many ways of storing data).
> IE6 allows cookies by default, question is what will later versions do?
It's hard to tell, but personally I think that the next IE will have similar way of handling cookies, however with a higher degree of warnings and information.
If anything is blocked by default in the next version, it is HTTP_REFERRER (which is bad enough to lose from an ebiz point of view).
| 5:22 pm on Dec 3, 2003 (gmt 0)|
>> If anything is blocked by default in the next version, it is HTTP_REFERRER
- could you elaborate on that? Any evidence or mainly a theory?
Imho, that would be a seriously bad thing.
| 5:30 pm on Dec 3, 2003 (gmt 0)|
I think Norton Privacy already does this type of block doesn't it? and yes - it is a seriously bad thing for trackers, but much can be got around by any link in using?source=soandso I guess.
| 6:28 pm on Dec 3, 2003 (gmt 0)|
> Any evidence or mainly a theory?
*Only* a theory :-)
> much can be got around by any link in using?source=soandso
Indeed. But in natural search this would be quite hard to do.
| 7:14 pm on Dec 3, 2003 (gmt 0)|
I'm a bit new to this whole bit of tracking.
I use PHP and cookies to track users, the data being dumped into mySQL.
Why such little mention of server side tracking?
(I feel that a mix of both server and client side would be a good idea)
| 11:13 pm on Dec 3, 2003 (gmt 0)|
I am a big fan of server side tracking. It allows for a greater look at what is happening through customized response logging.
For instance if on page X a user gets a validation error, record a response with a validation error code. Same deal for your login page, record a different code for each possible login outcome: login success, no user name, bad password, etc...
Marry this with user cookie and session tables and you have a very robust system capable of giving you a lot of detailed information about your site and what people do on it.
Another great benefit is if you write this out directly to a DB table it is right there to point at the rest of your data.
| 1:47 pm on Dec 4, 2003 (gmt 0)|
I got 123loganalyser to work, although it would be a right pain for me to use on a regular basis over 60 sites.
It did confirm my concerns about the accuracy of logfiles. I chose a day and site that had 11 sales that day. I then filtered the logs by the IP numbers of the buyers (which we record on confirmation of their purchase).
Lo - 123loganalyzer only 9 in this filter (And also doesn't work in Netscape...)
But, even with these 9, NONE showed me referring urls or domains outside my own sites. Infact, most just showed the return URL from the credit card authorising location, which presumably would be the same with all commerce sites trying to track sale complete.
I think there are two issues here. The first is that very few systems have done their utmost to define a user properly. I blame marketing people for this - accurately identifying a user is not a great USP they think... well it SHOULD be. Bad data in is equalling bad data out on every system I see. The second is that very few systems (hey, none) have made it their mission in life to measure exactly what we want. True, you can set campaigns and other things on tagging systems, but it is fiddly. If a person in one department sets up a new Adwords campaign, they shouldn't have to talk to techies in another department to get the campaign listed on marketing reports.
It is not too much to ask, to KNOW what your buyers typed in to find you and where they came from and for the tracking to match the number of credit card transactions. Not too much at all.
| 2:03 pm on Dec 4, 2003 (gmt 0)|
>The first is that very few systems have done their utmost to define a user properly. I blame marketing people for this -
Don't blame marketing people - blame users, who keep going over their log files with a fine tooth comb looking for information that isn't there and obsessing about things like which IP address visited their site using some weird ID.
>It is not too much to ask, to KNOW what your buyers typed in to find you and where they came from and for the tracking to match the number of credit card transactions. Not too much at all.
I fully agree, and have posted here a couple of times in the past year looking for advice on this issue. You were one of the handful of people who seem interested in that information. Which tells me again - the problem isn't the marketers, it is the users. To me, there is no single more important piece of information than where the people who book my homes came from (referrer-wise). But either not too many people are interested or they aren't sharing their secrets ;)
| 6:33 pm on Dec 4, 2003 (gmt 0)|
Suggest you try the product on my site, it does some of what you want - see profile for link.
Mods here told me that since I'm talking on my own product, it's against the TOS to discuss, answer your questions further. Unless of course I was to pretend to be a user of my product. But not my way, value integrity over money. They haven't yet answered the question I asked if the sentence above is allowed or not so guess I discover this way.. ;-)
| 7:08 pm on Dec 4, 2003 (gmt 0)|
Build it yourself. If you want any sort of information beyond the standard canned reports then you need to build it yourself. I say this for a couple of reasons:
1. Websites are unique creatures and you are not going to find a packaged solution that matches up perfectly.
2. Advanced analytics are expensive to buy, but cheaper to build.
3. You are going to have to open up the hood anyway to get data from your DB into any solution's DB so why not keep it simple and just build something that writes to your database.
Here is some background on the site I work on. The site handles a couple million uniques and about 50 million pages views a month. We are an ASP site with an Oracle DB. We are a portal for a subject so we do ecommerce, have user accounts, tools people can use, and a bunch of content. I am also lucky enough to have some good developers to build our stuff in house.
We issue two cookies to users, a user cookie which is persistent and a session cookie that gets issued for every session. We have two database tables that records every cookie that gets issued. One table is for the user cookies, the other is for session cookies.
The user cookie table stores the IP, user agent, initial create date, first visit referring_url, first visit initial_url, and the account ID, as well as the cookie value and a count of visits to the site.
The session cookie table stores the create date of the session, the initial URL of that session and the referring_url that brought them to the site for that session along with the session ID value.
These two tables on their own are fairly powerful in that you can figure out how many times each person visits your site and where they come from. Because they record the account ID you can also join them back to user tables to get an idea of what the user is and does.
In addition to these tables, we have a custom server side logging system in place that writes directly to an Oracle DB.
This system can either just act as a simple logger where the user hits a page and we record the URL, the user cookie, session cookie, account ID, date, referring URL, a section code (what part of the site you are in), or we can customize the logging to record more detailed information.
The customized portion allows us to store more detailed information such as response codes to certain actions, like if a login was successful or if it failed why, and certain object values, for instance if they placed an order
what the order number is.
Because all of this is written to a database it can be joined over to our other data like the cookie tables, or our user/order/other tables. This allows us to follow the path a user took, do debugging, and aggregate site behavior.
Finally we also have a click tracking system in place where we can record clicks like PPC buys and tie it to certain actions like account creation or order placement.
This whole setup has given us an enormous visibility of what our users do both within a session and over the course of their life with us. Because everything is already in one database, it allows me to slice and dice the data easily. I create a lot of summary tables off of this data and a lot of one off temp tables where I am examining one particular thing in detail.
The one thing that it didn't allow me to do was path analysis which I thought was next to impossible with SQL, but we just figured out how to do that using Oracle's nifty sys_connect_by_path command. Now we can create custom user sets and apply them to either our logging table at large or to smaller custom built traffic tables. For instance I can create a table of pages viewed immediately after an account was created and look at paths from there.
It took almost two years on and off to get this system built entirely but it probably cost a third of what it would have cost to buy something similar and is twice as useful as anything I have looked at because we built it to do exactly what we wanted.
I am not sure this would be worthwhile for smaller sites or for someone who can't code or does not have developers on staff, but I am a convert to DIY tracking.
| 11:14 pm on Dec 4, 2003 (gmt 0)|
What a golden post! Thanks for sharing your efforts.
I'm working on a DIY tracking system that fits in nicely with the CMS that we use in-house. The ideas that you have shared have saved my tired brain a lot of thinking!
|The one thing that it didn't allow me to do was path analysis which I thought was next to impossible with SQL |
I guess you mean sophisticated path analysis like:
|create[ing] a table of pages viewed immediately after an account was created and look at paths from there. |
Could this be done by storing an action (account creation in this case) that refers to a session -> Then query for that action, get the sessions and concatenate the urls to make a path?
| 9:38 am on Dec 5, 2003 (gmt 0)|
I agree - building our own is exactly what we are doing, but I am a marketing man, not a programmer, so we have different problems - the first being that I have to pay someone else, so the end product damn well better be sellable - So what is inexpensive to you is expensive to me. The other issues are:
1) We are marketing 60 sites over several servers, so I need to find a solution that crosses different types of business and works from any platform.
2) I do not accept that if I can't lay down a cookie the tracking goes to pot. If Cookies gives 90% accuracy, then it is quite possible that the 10% screws up the data massively if they are recorded at all and that has to be treated with much more seriousness than I see from publicly available software right now. I think our guy has found a very interesting backup way to identify the other 10%.
3) Managers want to see very standard reports, really - whether it is for a web site, a shop or a Steel Mill. They EXPECT accuracy - especially from the web - and they are not getting it. What is worse, their reports are not ESTIMATING the level of inaccuracy, so they aren't even aware that it is inaccurate.
Question: If a user doesn't accept cookies, and blocks http:referrer How do you identify him as the same person on page A as is on page B? We are planning to use an algo combining browser settings with his clock (can we do that?) but what else can we use to give us an edge?
| 11:40 am on Dec 5, 2003 (gmt 0)|
Hmm.. I went to quote someone for you in this reply and then realized the person I was going to quote was you! Ha!
|I do not accept that if I can't lay down a cookie the tracking goes to pot. |
This may be the reality of the situation (though I'm still learning).
|Managers want to see very standard reports, really - whether it is for a web site, a shop or a Steel Mill. They EXPECT accuracy - especially from the web - and they are not getting it. What is worse, their reports are not ESTIMATING the level of inaccuracy, so they aren't even aware that it is inaccurate. |
From my experience managers will have to have some things explained to them - in simple steps without too much technical jargon. They need to understand that what they are asking for may be too much, but that there is a reward to using the internet effectively.
Check out this article: htyp://www.inc.com/magazine/20031101/workingwonders.html
There is success to be had on the web, but unfortunately user tracking it's not an exact science. You can apply statisctics and error analysis etc. but in the end you learn the most from hard work - looking at your logs and your tracking data for trends and patterns.
[edited by: tedster at 12:08 pm (utc) on June 1, 2004]
[edit reason] make link active [/edit]
| 12:23 pm on Dec 5, 2003 (gmt 0)|
|realized the person I was going to quote was you |
LOL - I talk too much.
| 2:27 pm on Dec 5, 2003 (gmt 0)|
>If Cookies gives 90% accuracy
Probably more like 97%
>then it is quite possible that the 10% screws up the data massively if they are recorded at all
If the user does not acept a cookie you could still sessionize based on IP address. So then your "unknown" range drops from the 3%-5% range to a fraction of that (those users who don't accept cookies and come via proxy servers).
Like you, I am a marketer, not a tech guy. But I agree that most people don't understand how inaccurate typical visitor stats are once you get beyond basic page views.
| 2:36 pm on Dec 5, 2003 (gmt 0)|
|those users who don't accept cookies and come via proxy servers |
Hey Mardi_Gras, can you explain a bit more this business about proxy servers?
I do a bit of everything, marketing, coding, web design, usability, and this week, proposal writing. My brain is mush, I know that we've touched on this (proxy servers) before, here's how I see it:
- If a user has cookies disabled, you can get them with a session.
- If a user comes via a proxy server, he/she isn't even seen in the logs, no? I had the impression that the user wouldn't even be seen on your server, am I wrong?
So I figured that for users that come via proxy you wont even know it. Maybe I don't understand proxy servers very well!
| 2:40 pm on Dec 5, 2003 (gmt 0)|
>most people don't understand how inaccurate typical visitor stats are
Spot on. I did my thesis in sesionization, I did a lot of research inc. products.
Users dont want to know this, vendors wont talk about it, everyone just wants pretty reports :)
| 2:45 pm on Dec 5, 2003 (gmt 0)|
|everyone just wants pretty reports |
Good! I want conversions, let them have their reports!
Either I'm really tired or you edited this into your post Receptional:
|Question: If a user doesn't accept cookies, and blocks http:referrer How do you identify him as the same person on page A as is on page B? We are planning to use an algo combining browser settings with his clock (can we do that?) but what else can we use to give us an edge? |
For this you would have to settle with sessions, in which case you can't track return visits, or have them login to use your page.
| 3:48 pm on Dec 5, 2003 (gmt 0)|
"Question: If a user doesn't accept cookies, and blocks http:referrer How do you identify him as the same person on page A as is on page B? We are planning to use an algo combining browser settings with his clock (can we do that?) but what else can we use to give us an edge? "
If the page is being fetched from your server, using the IP address will almost always ID the pages the user visited.
If he blocks http:referrer that just means you won't know where he came from in the first place.
If you are not allowed cookies, I don't see how any unique ID helps you, you can *only* use an ID if you are sending it to your server. In theory *nor* returning cached pages can be forced but in practice I am told that ISPs sometimes return their cached copy of a page even when they should not according to the HTTP standards.
Perhaps a script that adds a?****x string to the end of each HREF when a link is clicked would allow the ID to be passed back and logged on your server logs as well as changing the URL and decreasing the likelihood that the request is resolved with a cached copy before it reaches your server.
Elsewhere I asked about changing IPs and people reported that AOL do this (some said always). I am temporarily using AOL in London now and so tested it by looking at my own server logs, they don't change my IP or use multiple IPs at all.
So the answer to your question about how to ID the user is IP address. The problem then becomes how to ensure that all requests for pages are logged on your server and not resolved from a cache.
And there are services that will constantly change your IP although my understanding is that they are used mostly by hackers and other crims who have a reason to hide. Perhaps also by porn seekers in some jurisdictions. If you're not a porn site this is then perhaps not an issue.
At the end of the day the answer to your requests for EXACT stats is some management speak explaining that the structure-internals of the www will always make totally exact impossible.
| 3:53 pm on Dec 5, 2003 (gmt 0)|
>I am temporarily using AOL in London now and so tested it by looking at my own server logs, they don't change my IP or use multiple IPs at all.
I track visitor activity using AXS and I can assure you that virtually every page request from an AOL user (in the same session) comes from a different IP adress than the previous request from what I know to be the same user. Perhaps AOL/UK is different - I have no experience there.
| 3:57 pm on Dec 5, 2003 (gmt 0)|
>using the IP address will almost always ID the pages the user visited
At worst it id's the proxy/domain controller the user accessed the web through. At best it ids the workstation, IP address never id's the actual user.
| 4:05 pm on Dec 5, 2003 (gmt 0)|
I have been actively tracking for the last 36hours on two of our sites and can assure you that IP is a poor mehtod for establishing a unique ID. For some it switches.
| 4:43 pm on Dec 5, 2003 (gmt 0)|
First off nice to see a decent in depth discussion happening in Tracking and Logging. We need more of them in here.
1. We also run an old version of Accrue that does our standard canned traffic reports. We still need these for our sponsors and to get a general idea of how well our content sections are doing.
We have plenty of standard reports floatng around, but for us its more a matter of "does everything look ok, did we sell what we thought we were going to sell? If yes, great tell us something we don't know now. If no, then find the problem."
For me I always need to be uncovering the new stuff, and just making sure the old stuff looks right so I am less interested in standard reports. They are still very important, just don't excite me that much.
My managers expect accuracy in $$$, not in traffic. They also expect me to help bring in $$$.
2. What's important for your site, isn't quite so important for mine. You cannot do anything functionally useful on our site if you do not accept cookies. You can read content, but that's it.
That's why I am such a fan of cookie based tracking. We still try to issue a cookie to every request that does not have them, that means bots get a ton of cookies that I need to filter out by UA or IP.
It's not that I have given up trying to be 100% accurate, its just that I think it is nearly impossible for my site and the web at large. I can tell you where 99.9% of our revenue comes from, and if the remaining 0.1% comes from people disabling cookies, well my time is much better spent trying to get more out of everyone else than knowing a for sure count of a fringe case.
Your site my be an entirely different can of worms where not knowing the X% of cookieless, proxied, stray cats can make or break you.
Why is it so critical for you to be 99.9% accurate with your traffic?
That's why I say DIY. Only you can know what is critical for your business. It sucks to shell out money for a product and find it can't meet the one or two critical issues you have.
Mipapage: I'll throw up my method for pathing using Oracle in another post. Maybe it will give you some ideas.
| 5:03 pm on Dec 5, 2003 (gmt 0)|
I agree qith cfx211 - Kudos to Receptional for starting this great thread.
|That's why I say DIY. Only you can know what is critical for your business. It sucks to shell out money for a product and find it can't meet the one or two critical issues you have. |
That's what struck me in that article on inc.com that I posted above. Each of those companies:
- Use tracking in one shape or form
- Built their systems themselves
- The last ones (from what I can remember) do a lot of hands on dirty work to 'find success' in their website. (ooh, I like the sounds of THAT! "We help you to find the hidden success in your website"- 0.3cents per utterance ;-] )
cfx211 - Looking forward to the path stuff!
| 10:45 am on Dec 8, 2003 (gmt 0)|
Thanks for everyone's input. Lots to think about here.
|Why is it so critical for you to be 99.9% accurate with your traffic? |
99.9% is not vital... BUT... If non-cookie traffic is ignored altogether from measuring trends then this is probably better than anything else, and is at least complete in its own right. However, my concern is that the percentage of people rejecting cookies is on the increase - from privacy software shipped with Norton Antivirus, PDAs or (potentially) future IE products requiring opt-in to cookies rather than opt out. In November, 96% of visitors had cookie support, So far on December it is only 95.1%... So the trend is not looking favourable for cookies long term on their own.
For this reason, I think it is important to look at trying to measure those that do NOT have cookies, and that is where the 4.9% can start to cause havoc with the stats... are they really 4.9% of my traffic? Or are they less? I do not know, because I am currently using IP number to measure these, which - as those above have shown - has its own errors.
Session variables seem the next logical step - which at least tracks a visitor for the duration of the session, if not returning visitors. But session variables have a new set of problems on the search engines and when (if) people link to interesting content they may well use the whole link including the session variable as one possible error that might arise.
Is there any way to use a computer's mac address? that would do the job just peach.
Assuming there isn't, then my logic for the way we want to move forward is beginning to look like this:
2) If a user accepts cookies, then fine - record them as such.
3) If they don't, then create an ID based on all the variables that they DO pass to us (but I am thinking of using an IP range, such as the first few blocks in the number, rather than complet IP numbers), combined with the difference between their system clock and the server system clock.
4) Use that ID to identify a returning visitor (as long as they haven't changed their screen resolution etc).
Of course, this will break down when a site becomes so busy that there are many people using the site at the same time with similar clock time, from the same ISP and identical browser settings, which is why I am trying to work out more ways to make individuals "unique", but we are starting - I hope - to really squeeze the error margin by thgis stage.
Interesting that someone did their thesis on this. Care to share?
| This 54 message thread spans 2 pages: 54 (  2 ) > > |