Forum Moderators: DixonJones

Message Too Old, No Replies

Trusted Analytics Tool Proven Wrong

         

incrediBILL

11:10 pm on Mar 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In one of the nastiest turn of events I've ever had, my old supposedly tried and trusted server side log file analytics tool turns out to be over-reporting visitors.

For some odd reason this month the stats appeared to be wildly fluctuating by up to 50% for several days which immediately put me into a panic mode that maybe my SE rankings were fluctuating as well.

OK, so I wipe the sweat off my brow and check Google Analytics which showed no change, everything was status quo, but always slightly lower since Google only counts browsers with javascript enabled so web spiders and disabled javascript don't show up in GA.

Now concerned that maybe my analytics script was crashing or something during processing, I decided to manually analyze my daily log files and get a raw number of actual IPs in the file.

Hopped in my stats directory on the Linux box and started checking each daily log file with the following command just to get a ballpark of total visitors looking at webpages:

grep ".html" -i access_log ¦ grep ".png" -iv ¦ grep ".txt" -iv ¦ grep ".jpg" -iv ¦ grep ".gif" -iv ¦awk '{ print $1 }' ¦ sort -n ¦ uniq ¦ wc -l

The reason for the various exclusions like 'grep ".jpg" -iv' is to eliminate the images and other files optionally being served off other servers, such as banner ads linked to my site, etc.

Sure enough, the script cranked out a number that almost matched some of the visitor counts but the log analysis script was wildly off on other days. Then I reversed the grep to get a count of files not being served from my box and the total numbers combined still didn't make sense with the discrepancy.

I'm not sure what to do at this point because it's obvious my raw log file analytics are a total lie and I'm not sure if it's just a problem with my site/server or if this software is simply buggy.

Not to panic everyone, but it's the default web analytics that's included with Plesk's control panel, so I've had quite a few years of history with this thing that now all appears to be somehow tainted.

Sigh...

Anyone else get similar discrepancies?

[edited by: incrediBILL at 11:14 pm (utc) on Mar. 21, 2008]

coopster

2:31 pm on Mar 22, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



There are at least 2 analytics packages running in Plesk control panels. The oldest being the Webalizer which mashes your logs through it's own C programs IIRC. There are settings in the conf that allow you to specify which logs to process as well as history and how much. You aren't looking at two different sets of data here are you?

incrediBILL

5:54 pm on Mar 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's Webalizer and no, I'm not looking at 2 sets of data.

What's confusing is the behavior has been very consistent until recently, but it appears it's always been over counting.

coopster

10:10 pm on Mar 22, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



it's obvious my raw log file analytics are a total lie and I'm not sure if it's just a problem with my site/server or if this software is simply buggy

Webalizer builds off the CLF so it seems unlikely that your raw log is wrong or corrupt, but never say never. As far as the software, the last update was April 16, 2002 and there was a note in the fix regarding

mismatched KByte totals
so if you are running anything less than Version 2.01-10, you don't have the "latest" copy.

incrediBILL

11:46 pm on Mar 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the heads up coopster but I had already checked, it's the latest, and KByte totals would be the least of my problem.

My local log files are just fine as I can run other tools on them that appear to crank out the correct information.

I don't think there's much more I can do considering it's integrated into Plesk without risking breaking things so I'll just find some other raw log analysis tool that appears to be more accurate and run them in tandem and see what happens.

coopster

6:20 pm on Mar 24, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I wouldn't know what else to do at that point either. I was brainstorming with ya, trying to think logically at first what might cause skewed figures. That software is open source (programming language is C) and the software logic builds summary data by compiling stats from the CLF, if I remember correctly. And it does so based on a user-configurable .conf file. Without following the code via breakpoints, etc. I am uncertain how you are going to find any discrepancies. Ugh. I'm interested to know what you discover though.