homepage Welcome to WebmasterWorld Guest from 54.161.214.221
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
Forum Library, Charter, Moderators: Receptional & mademetop

Website Analytics - Tracking and Logging Forum

This 54 message thread spans 2 pages: < < 54 ( 1 [2]     
How to track visitors
An intermediate guide
Receptional




msg:905184
 3:05 pm on Dec 3, 2003 (gmt 0)

It seems to me that there are far too many questions repeated in this forum, as people come in, usually finding their ISP's log analysis is one of the following:

1) non-existant 2)wildly innaccurate or 3) incomprehensible.

The first question they then ask is "what package should I use" and without naming names, this thread should tell you what to look for and why.

Logs Vs. Server Side (java tagging).
Let us be clear. Log files were originally intended to measure how much a machine was being used by other computers. The idea of measuring human behaviour is a relatively new one and not an activity ideally suited to log files. By comparison, client side tracking (usually denoted by java code on your web page which runs when a user loads the page) is disastrous at tracking robots, or working out if your webserver is getting too busy to cope.

Why do all tracking systems give different results?
Here are 5 reasons why logfile tracking systems scew up measuring users (I could think of many more, but these will be enough to confuse most bosses):

1) They identify "unique users" (generally) by the IP number that the user is arriving on. However, many people can come to the website in tyhe same day on the same IP if they go thruogh a dial up provider. This error will be more prominent on busy sites.
2) Some ISPs (infact, most) use proxy servers to cache pages. This means user A requests your home page. Ten minutes later, person B requests your home page, but it is delivered by the proxy server, not your website, so there is no record of this event on your logfile.
3) Even conversion data is ruined by point 2. This is because person B may then click on a link and lo... he appears to have started on an inner page on your site, when he didn't.
4) Point 2 also wrecks search engine tracking, since the user links to an inner page and lo... the referring domain appears to be an internal referral.
5) When a person returns to a website the next day, they will usually arrive on a separate IP number and thus be identified differently. Log file analysis cannot tell you how many times a person comes before they buy. Nor can it track returning visitors back to their original referrer.

So ... java tagging is better right? well... not necessarily...

Here are 8 reasons why java tagging systems scew up measuring users:

1) Java tags must go into HTML code, which means it requires ingenuity to track pictures, sound files, pdf files, or any other types of ownloads
2) The tag tracks browsers, not users. So if a person looks at your site from work, then from home, they are counted twice.
3) If a person looks at your site in Netscape, then in Internet Explorer (because you wrecked their user experience in Netscape) then they will be counted twice.
4) Java tags do not record spiders, such as googlebot.
5) If the java tag uses a third party cookie on a server without a valid privacy policy, then even Internet Explorer 6 default settings stop the cookie being laid.
6) The user may block all cookies if they choose.
7) If the server called up by the java tag is slow to respond, then the user may never be tracked, since they move from the page before the page load is complete.
8) As I sit writing this from a fancy web phone I have to add that some people do not see your site using java enabled browsers... they will not be tracked.

This said, Java tagging is MUCH better for most people to use, probably correctly analysing 90-95% of human users, whilst (IMHO) even the best log tracking can miss 35% of users on busy sites (though much more accurate on quieter sites).

My advice, therefore, has to be the latter approach from marketing perspective. So... even BEFORE saying how to select a third party product or not, I better explain what you must first do to make it legal to even use such a system... (yep... it will be illegal if you don't do the following)

On December 11th (2003) a new EU law comes into force that says you MUST tell users when you are using cookies to identify them, and you MUST tell them how they can opt out. My advice (and I am no lawyer, get legal advice yada yada) is to have a link on all pages saying "how we track" then have a page explaining that you track using cookies and that they can disable this in IE settings (explaining how). Actually, we track other things in the event that we cannot lay a cookie, so I intend to suggest that if they want to opt out, they better go through a site that disguises their real identity.

Choosing a java tracking system:
I cannot recommend any here, but consider these factors:
1) The BEST is a system that sits on your own server, thus eliminating any problems with third party cookies. For busy sites, this may also be the most cost effective, but is certainly the hardest to integrate and requires you to have (probably) a dedicated server.
2) The best THIRD PARTY systems will also be the most expnsive. Important is not whether it tracks visitors, but whether it can show TRENDS over time and CONVERSIONS by search engine, or keywrd or campaign.
3) Even the best third party software is dependent on its own server load and server downtime. So, the most popular may also be the least accurate. (Not saying it is... but saying it could be a victim of its own success). Also consider whether the stats server is 7,000 miles from your own users, as this could be a factor.
3) Very low priced systems probably do not have a compact privacy policy, so all IE6 users are not tracked.

Hope this helps a number of people in this forum.

Dixon.

(Edit add: When I say "Java" I mean Javascript - one of my staff just pointed that out.)

[edited by: Receptional at 3:34 pm (utc) on Dec. 3, 2003]

 

aspdaddy




msg:905214
 12:59 pm on Dec 8, 2003 (gmt 0)

>percentage of people rejecting cookies is on the increase

I'd agree with that, people i know are generally getting smarter and more paranoid when it comes to security/privacy issues.

Other things to consider:

Caching is on the increase, approx 30% of all pages served i read somewhere are served from a cache of some sort. This make pathing (topolgy heuristics) very difficult.

Bots/Spiders are on the increase :) If you are tracking stats for usability or visitor analysis, this is a big problem. Some reports show 40-50% of all traffic is generated by non-humans - that figure surprises me. Though most can be filtered out by either IP,UA,or behaviour - not all products do this equally.

I cant see how sessions help much as they rely on cookies to implement.

The best approaches IMO are formulate your questions/queries first, then go find/build a tracking solution to suit.

mipapage




msg:905215
 1:44 pm on Dec 8, 2003 (gmt 0)

This thread is phenominal!

While I've been trying to write up our script (and sleeping, thanks Receptional!) I've got some questions and ideas(?) from some of what you have all shared:

  1. Caching is on the increase, approx 30% of all pages served i read somewhere are served from a cache of some sort

    Can someone explain this a bit? What I get is that if someone recieves the page via proxy-cache we'll never know, is that right? Does 'no-cache' inthe metas prevent this?

  2. Bots/Spiders are on the increase

    Fwiw - I use a user-agent filter to keep bots out of the tracking system - This is remnant of not giving sessions to Googlebot.

  3. Use a javascript

    Why not server side (am I missing something here?)

  4. I cant see how sessions help much as they rely on cookies to implement.

    Though I'm a rookie at all of this, I believe that PHP (sorry aspdaddy) uses a session variable if cookies are disabled, therebye getting around this problem.


Cheers - mipapage

Receptional




msg:905216
 2:08 pm on Dec 8, 2003 (gmt 0)

What I get is that if someone recieves the page via proxy-cache we'll never know, is that right? Does 'no-cache' inthe metas prevent this?

Not necessarily to either question. Most likely is that a cache-user is invisible to the server on the first page he/she lands on but when teh user deviates from the previous cache-version's pages, then they will appear to START at an inner page, with a previous page being the referrer. Also, "no-cache" (if it does work) will put unnecessary(?) strain on your server and increase load times - especially for people on the other side of the world.

Use a javascript - Why not server side (am I missing something here?)

Server side generally means log-files, which is exactly what gets screwed up due to caching.

I cant see how sessions help much as they rely on cookies to implement.

Though I'm a rookie at all of this, I believe that PHP (sorry aspdaddy) uses a session variable if cookies are disabled, therebye getting around this problem.

If pages are written in PHP or asp (not javascript) then there should be little issue with using session variables that store info. in a database rather than in cookies. But I am no programmer, so happy to be overruled on that.

mipapage




msg:905217
 2:17 pm on Dec 8, 2003 (gmt 0)

Server side generally means log-files

Ah, gotcha. What I do is write all of the data to a database, and in truth it looks a lot like a log file!
will put unnecessary(?) strain

Good point.

aspdaddy




msg:905218
 3:23 pm on Dec 8, 2003 (gmt 0)

>cookies

Yep, I was refering to the built in session objects/session id's in IIS that are written to the log files - they dont work without cookies enabled. What you get reported is a series of 1-page visits which can really screw up stats for a low traffic site.

You can DIY sessions with a database or text file or even worse write it into the url, but its all at a cost.

Receptional




msg:905219
 3:48 pm on Dec 8, 2003 (gmt 0)

objects/session id's in IIS that are written to the log files

Does IIS do that by default? Back wehn our Microsoft server was NT, We tried to put in a webtrends tool and a Deepmetrix tool (Livestats) than was supposed to do this for those two packages and managed to crash the server, both times so we didn't try since. Is it built into IIS these days?

aspdaddy




msg:905220
 4:11 pm on Dec 8, 2003 (gmt 0)

Yes thier is an option that if set , writes session ids to the log file for every entry.

I believe .NET has an option to also write them to url, the woman i spoke to at m/s called this " session-munged urls" - nice :)

Webtrends has an option to overidde its own sessionizer and read session id from from either the uri-qry or cookie field inb the log file

PCInk




msg:905221
 4:23 pm on Dec 8, 2003 (gmt 0)

>I am temporarily using AOL in London now and so tested it by looking at my own server logs, they don't change my IP or use multiple IPs at all.

AOL do change the IP address with every request for some users. Sometimes you log in and they give you a static IP address, you apparently get a static address if you visit a secure https page for example.

There are ways around caching by proxy servers. Some ways are to have cgi-bin in your filename, or end it in cgi or the pl extension. Other ways: include a? in the URL. Of course, this will appear to slow the experience to the user and will create a greater load on your server and increase your bandwidth usage.

Further info: [webmaster.info.aol.com...]

oneworld




msg:905222
 11:50 pm on Dec 8, 2003 (gmt 0)

A feasible way to ensure a visit to a page is logged even if the logged page is returned from a cache is to add a very small image which is not important to the content and specify <nocache> for the image. (I am assuming <nocache> can be specified for a specific file, never done it so not sure how this is done)

A differrent image for each page will ensure that you always know the page was visited whilst the user gets the cache copy quickly. Only problem is that the referer will presumably always be the page containing the image rather than the referrer for the page which is what you want to see.

In theory this should always log all visits to each page on your server (without depending on cookies). It does depend on correct (according to w3c spec for HTTP clients-servers) handling of caching which is not always followed.

Receptional




msg:905223
 9:56 am on Dec 9, 2003 (gmt 0)

I am assuming <nocache> can be specified for a specific <image> file

I don't think so, no. But if it can I am happy to be told otherwise. It is an interesting idea.

Tor




msg:905224
 2:18 pm on Dec 9, 2003 (gmt 0)

This is the best discussion in the Tracking and Logging Forum that I have ever seen! Being mostly a marketing person myself I really can`t contribute so much to this discussion, but that doesn`t mean that I cannot learn from it.

Thanks for sharing your knowledge and insight with the "rest" of us! :)

killroy




msg:905225
 3:27 pm on Dec 9, 2003 (gmt 0)

I've always done my own tracking, and currently use a library I've written that dumps everything to a normalized set of MySQL tables. This way it doesn't store redundant information, and I can run split second live queries. Currently this way the logs take around 10-20% the space as the equivalent Apache style logs.

All my pages are highly dynamic, and I think so uncacheable that I capture most logs. In fact, I seem to capture MORE then client side solutions, as th elikelyhood of hte image or JS acceess getting lost is greater then my pages being missed.

I do pay a price though, as I type this my cursor stutters as Google is hitting my test server on this machine (around 2000 pages/day indexed).

But CPU time is cheap, and good statstics invaluable.

SN

richmondsteve




msg:905226
 5:36 pm on Dec 9, 2003 (gmt 0)

There is no way to accurately track 100% of users across and within sessions without requiring cookies or authentication via login. Others noted that with some ISPs (and anonymizers) a user's IP can change during a session, users may be served cached copies of pages and tracking by IP (or subnet) and other variables like UA and timestamp have limitations. Be careful with assumptions you may use to track cookie-less users - 1. IPs can be dynamic within a session, 2. IPs can vary across sessions, 3. multiple humans can share the same IP, 4. multiple humans can share the same physical computer.

Session variables passed via the URL query string are an option I've used, but I use it less and less because 1. people will link leaving the query string with session id intact, people will email the URL with session id to other users and it's prudent to use the session id in conjunction with *something else* (IP, subnet, UA, etc.) and/or they are set to expire after a very short period of time, especially if it gives access to personal, confidential or sensitive info. associated with the intended user. On that note, I recieved 4 figure visits to one of my sites over the weekend via web-based email accounts thanks to the site's URL being mentioned in a high-volume newsletter. As I was analyzing the Apache logs in real time (tail -f) I noticed HTTP_REFERERs with session ids embedded in the query string. Out of curiosity I tested a few (5 or 6) and two gave me access to the visitors' web-based email accounts. I contacted the users and the companies hosting the accounts to let them know about this vulnerability.

I rolled my own tracking solution and I agree that this thread is very informative. There are still plenty of modifications and additions I need to make to improve my solution to meet my needs and due to limitations of available data, reliability of available data (UA, referrer and other data can be faked or misleading) and limitations of various solutions I'm well aware that no solution will be perfect, but I'm pretty content with the accuracy and limitations of what I'm measuring and it's allowing me to analyze data and test the effect of changes, increasing revenue and improving the user experience so I'm pleased.

mipapage




msg:905227
 5:59 pm on Dec 9, 2003 (gmt 0)

So it seems that chasing accuracy may be a dimishing returns problem - something to make very clear to the clients, who as Receptional pointed out (in his case Managers):
  1. want to see very standard reports

  2. They EXPECT accuracy

Off to write a "problems with accurate tracking" disclaimer"

oneworld




msg:905228
 11:54 pm on Dec 9, 2003 (gmt 0)

>>>>
I am assuming <nocache> can be specified for a specific <image> file

I don't think so, no. But if it can I am happy to be told otherwise. It is an interesting idea.

<<<<

Well the reason I assumed it is possible is that it's part of the HTTP protocol (see rfc2616 chapter 13 on caching).

However to use that part would require that the server was able to pass a HTTP header for a specific file indicating no caching for that file. I would imagine some servers do or at least can.

Anyway it only closes one (big) hole (loss of data}, I also agree that 100% accuracy is not possible.

cfx211




msg:905229
 5:05 pm on Dec 10, 2003 (gmt 0)

I think there is a correlation between the increase of cookie rejection and search being the savior of the internet. Bots are a lot more active lately and almost none of them accept cookies.

Issue them one, they turn it down, issue another, they turn that down, and so on and so on.

At this point I think there is a consensus forming that no one is going to be 100% accurate in their tracking, and that there is a point of diminishing returns in trying to get close to 100%.

A couple questions to consider:

1. How do you handle fringe cases like hot linking of images and files hosted by you but being pulled by another site? These can get counted by some tracking tools.

2. What percent of accuracy is acceptable, and how do you truly know what 100% is? Personally I can't hang with this "leave no visitor behind" philosophy because its counterproductive to running the rest of your business.

3. Do combo tracking solutions (cookie, IP/user_agent, other) work and can you really work efficiently with two different metrics serving the same purpose?

I can't imagine trying to tie cookies and ip/user agents together and then making sure there is no cookie present for my IP/user agents. Or making sure that I realy had only 10 visitors from Reston VA or was that 1 with a dynamic IP from an AOL proxy server? You get the idea.

mipapage




msg:905230
 6:06 pm on Dec 10, 2003 (gmt 0)

Bots are a lot more active lately and almost none of them accept cookies.

Just for fun, I'll pass along the data that I get from our test script, which ignores bots. I'll do a run from, say Thursday (tomorrow) morning to next Thursday. We get a good diversity of visitors from Europe, N.A, and Central and South America.

What percent of accuracy is acceptable

I just came from a meeting with a client; they are hot on user tracking and making their site 'work' well. I explained to them the bit about cookies, sessions, bots and all of these data quality issues, and in the end we're looking at:

  1. Determining how many of their (non-bot) visitors have cookies disabled.
  2. Storing both cookie and session data (thanks cfx211).

This way we can tell them how many of their visitors had cookies enabled. The reverse can only be an estimate - thanks AOL!

By storing the two types of data we'll likely categorize sessions attached to users with cookies as reliable. In a sense, hoping that there are enough cookie enabled users to provide usable data.

The site will also have a login functionality which will be considered reliable data.

What they (and I) understand is that this is hands on work, accuracy of the data will be in a flux, and tweaking necessary.

Bottom line is rather than chasing perfect data, determine what is reliable data and make decisions based on that, 'cause in the end it's all about ROI.

Though trying new ways to get better data is something that I will do with my coffee cup and midnight oil!

mph88888




msg:905231
 7:36 am on Dec 12, 2003 (gmt 0)

>> I can't imagine trying to tie cookies and ip/user agents together and then making sure there is no cookie present for my IP/user agents... >>

Good point cfx11. Double counting of visitors is something that hasn't been addressed very much in this thread. Even if you find a way to squeeze out that last 5% of accuracy in your web stats, what about those people like me who have both a home and work computer? I'm sure that BBC.co.uk sees me at 2 different unique visitors (assuming they even track visitors). For online banking and hotmail it's not a problem as I am required to log in from both locations, but what about non-login/registration sites? I'd be interested to see if anyone out there has estimates on the number of *unique* visitors who are actually double-counted in this way.

Mike

Receptional




msg:905232
 11:23 am on Dec 12, 2003 (gmt 0)

When people use a website from home and work, then the ONLY way to do is by an enforced log-in, although the log-in can be made simpler for the user by storing the login on both machines in cookies I guess.

But a stick-up gives real usability problems.

cfx211




msg:905233
 9:57 pm on Dec 12, 2003 (gmt 0)

I estimate a 30-40% overlap in home and work usage for our site, but this is a very site dependent trend. Our busiest account segment averages about 2.5 cookies per account over a 6 month period.

We do this by keeping an account_id column on our cookie table and then updating it when someone logs in. Based on that I then create an account to cookie xref table.

I was actually talking about what happens when you try to use two separate systems to keep track of unique users. I would worry about making sure that you were not double issuing unique identifiers using each method once.

Maybe one day the cookie would not issue for whatever reason so you used the fallback method. The next day they come again and this time get the cookie.

Receptional




msg:905234
 9:34 am on Dec 15, 2003 (gmt 0)

I would worry about making sure that you were not double issuing unique identifiers using each method once.

Hmm - I take your point, and potentially that could inflate the errors, like using IP numbers can.

But there is no way most sites can force a login/registration and still convert, which is what I assume you need to do the xref table... Hmm - or maybe there is and I just haven't had the courage to try it.

Are there any reliable studies about conversion rates on sites that force a login vs sites that don't? You would need a study that worked over time, since presumably people logging in can then be "kept more loyal" through email.

30-40% overlap between home and work use is, in itself, hugely significant. If that replicates across the Internet world, that means that potentially there are only 2/3rds of the number of real web users in the world than statistics are telling us. That is certainly significant.

cfx211




msg:905235
 6:21 pm on Dec 15, 2003 (gmt 0)

We don't force login on our site. About a third of our uniques at any time are members who have logged in, and those users are the ones that I do the cookie to account xref table on, and those are the users I know have a significant home/work overlap.

Most casual users of our site don't create accounts and login, so we can't get a good idea of our true home/work overlap. We just know that at a base it is X% of uniques, but that is probably a big undercount.

Now...

Another interesting thing to consider is how often people use the site. Our site has something close to a traditional 80/20 split going on where 80% of activity is done by 20% of people (See Pareto for more on the 80/20 rule). For us its 70% of visits come from 30% of uniques.

If we hypothetically had 1 million uniques and 10 million visits over a period of time, then 300k of those uniques accounted for 7 million visits. That leaves us with a lot of one visit uniques, or two visit uniques that may be nice for banner impression but not too much else.

When I hear numbers like blah blah million uniques, I think ok how much is home/work overlap, then I think how many months are they carrying this unique count over. If its the sum of monthly counts or even worse weekly accounts, then you can slim that number down even more. Finally after taking that into account, I think how many of those uniques are active users of a site.

Active really depends on a site's definition, but just about every site has its core group of users that are the heart and soul of the site and provide the majority of the site's revenue no matter what the business model.

How that group provides that revenue is unique per site which again is why I am for DIY tracking.

Receptional




msg:905236
 9:31 am on Dec 16, 2003 (gmt 0)

About a third of our uniques at any time are members who have logged in, and those users are the ones that I do the cookie to account xref table on, and those are the users I know have a significant home/work overlap.

I can see that your trend data is probably incredibly accurate, but I guess the down side of ONLY recording data from those allowing cookies goes back to accurately defining how many people are on your site not using cookies. I think that you are probably right not to pay any attention to these people at all, but I still feel that blue-sky and long term that will be a problem. Especially in Europe (see the new legislation in force at out-law - [out-law.com...]

Also, how many pages get half loaded and people move on before the cookie gets properly recorded.

I guess it is the law and paranoia that is going to limit cookies in the future, not technology.

mipapage




msg:905237
 12:32 pm on Dec 19, 2003 (gmt 0)

Just for fun, I'll pass along the data that I get from our test script, which ignores bots.

One week, filtering out everything that identifies itself as a bot, 95.6% of visitors accepted cookies.

<disclaimer> I know that this data means nothing to most people! </disclaimer>

It's all about data quality... what counts and what doesn't count...

This 54 message thread spans 2 pages: < < 54 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved