Welcome to WebmasterWorld Guest from

Forum Moderators: DixonJones & mademetop

Message Too Old, No Replies

Need help in interpreting raw log files

I find Webalizer reporting data notably different from raw log files

4:44 pm on Jun 4, 2009 (gmt 0)

Junior Member

10+ Year Member

joined:July 1, 2005
votes: 0

My Webalizer server stats is reporting a number of page views far and away lower than what I found by looking into my raw log files. As I'm not sure if I am interpreting my raw log correctly, I would appreciate your advice about method I am using in doing it. Here's an example of how a line of my raw log files looks:  - -  [20/May/2009:09:28:04 -0700] "GET /index.php/ HTTP/1.1" 200 45626 "http://www.google.ie/search?hl=en&q=..." "Mozilla/4.0 (compatible; ....)"

to make it more readable, below I make a list of the values included in the example line (with each numbered value corresponding to a column in the line):
2. - -
3. [20/May/2009:09:28:04 -0700]
4. "GET /index.php/ HTTP/1.1"
5. 200 45646
6. "http://www.google.ie/search?hl=en&q=..."
7. "Mozilla/4.0 (compatible; ....)"

In order to know the number of page views I take only the files with extension .php included in the column 4.
Now, what I find through this method is a number of pages hugely bigger than what Webalizer reports.
Also, in order to get rid of log-entries caused by robots, I eliminate the pages where in the corresponding column 7. (including info about browser) I find the string: bot. Even with that I am still finding a number of page views largely higher than what Webalizer reports.

Am I missing anything? Also, as most of my pages views have a dynamic Url I wonder does Webalizer read all of pages views having dynamic Url?


4:15 am on June 10, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 20, 2008
votes: 0

Not all bots have bot in ua.

I take the raw logs and write a db table, set on a script hourly. Because the logs reset daily, I key them by ip + time + other unique info and don't update twice.

On the way in I lookup each from a robots IP lookup table. If bot goes to second db table.

(as I find new bots I add their 4 segment IP).

I lookup first by 4 segment IP in robots table,
if a range I've banned that jerk bots come from and they change it, I simply write 3 segments to bot table, (1st 3) which it also checks if 4 segments not found.

Now that separated to users (or unknown bots not already in my bot table), and bots, is much easier to get counts.

The script program displays a continuous hourly count but I can on demand or daily do query off the users table, and if I've added any that were really bots it skips them.

(sometimes you need to look at behaviour to see if bot, like it hits head which real users don't or scrapes a bunch quickly, etc)

then you can do whatever you want in ways of query with the table, which tends to grow large.

I can sep out by IP to give a 'visit' trail for any IP or chain of IPs in the case of AOL visits and the like.

Other interesting analyses can be done.
I wouldn't put too much into webstats figs. they are not as smart as you can be if you analyze the logs yourself in detail

I keep one log a day so I can reload the db if something goes corrupt.