Forum Moderators: DixonJones

Message Too Old, No Replies

Crunching large log files

         

Vimes

11:43 am on Mar 1, 2006 (gmt 0)

10+ Year Member



Hi,

Ours is a global B2C DotCom company. Having an average 1.7G log file a day. We have over 200 thousand pages on our site and average half a million page views a day. We are using a commercial log analysing tool but its having problems with the size of the log files, plus crounching that much data its destroying the speed of my server.

Any recommendations on which are the best analytical software to track referrals of all kinds, That does not use huge amount of server resource when using them.

Thanking you in advance.

Vimes.

nmattheij

1:38 pm on Mar 1, 2006 (gmt 0)

10+ Year Member



Perhaps you can let the server create hourly log files instead of daily.

Find your log rotation settings on the server and see if you can change it.

Matt Probert

1:50 pm on Mar 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Two ideas:

1) Transfer the log files off the server before analysing so that the analysis does not clog up the server.

2) Filter out of the log files any superflous lines. Perhaps if you are not interested in analysing images transfered, you can filter them out, thus leaving a greatly reduced log file. Or favicon.ico requests, just as examples.

Matt

carguy84

4:14 pm on Mar 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I found seting up a zip job to zip the files each night really helped when I'd pull them into urchin. Urchin seems to be able to handle really large log files well.

Chip-

Vimes

1:31 am on Mar 2, 2006 (gmt 0)

10+ Year Member



Hi,

Thanks for the feed back, filtering sounds interesting. When I put this to the admin boys they will say how, could you give me an example of how I can filter out the images. Do I have to wait for the day’s log file to stop recording or is there a method that I can stop the requests of .jpg and .gif request every being logged in the first place.

Thanks again.

Vimes.

ronburk

7:10 am on Mar 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would do the filtering as part of the process of copying the log files to a separate machine. The thought of simply not logging, say, .jpg files should give the "admin boys" a slightly uncomfortable feeling. Like, hmmm, how would we detect if some BadButDumb Person has copied our website but is using our server to supply the images (at our expense)?

I always like to leave everything in the original logs, in case sometime in the future I decide that something I thought was not important is important after all. But, if you're dying in a flood of log info, you gotta do what you gotta do.

You might also want to filter out HTTP requests that came from within your own company.

Although I leave everything in the original log, before doing log analysis I filter out:

  • Local network requests to the website (based on IP address range)
  • All presentation graphics (which I have conveniently arrange to reside in "/images/", so it's real easy to filter them).
  • All CSS files
  • All JavaScript files.
  • /favicon.ico
  • All crud requests (buffer overflow attempts, Unicode exploits, too-dumb-to-notice-I'm-not-running-IIS hacks, etc.) These can be pretty big requests sometimes.

So, mostly, that's local requests, plus all embedded HTML references, plus that damnable icon from hell.

Unrelated Musing: Hmmm, I wonder how my MySql schema for logfile analysis would perform if I was pumping .5 million entries in per day? Probably not so good after several months. Oh well, the poor little Pentium II it's running on is handling several million entries OK so far, so I'll just hope any great increase in traffic brings enough money for a hardware upgrade.

Vimes

10:42 am on Mar 2, 2006 (gmt 0)

10+ Year Member



Hi

Thanks Ron,

sorry should have made it more clear i suppose,

Crunching the stats the package I use actually does all that you have mentioned and is already implemented, the “Admin boys” have been complaining that the log files are getting to large, taking valuable server resources to compress and taking too much disc space on the server, they curse me very time I want the stats zipped and ftp’d down to my desk top. So it’s the log file size that they are complaining about. Once I’ve got it on my desk that’s not too much of a problem I can destroy it how ever way I want. We aren’t ready for a designated server for the stats, just isn’t cost effective as yet but they want me to investigate ways of making the files smaller.

They want me to reduce what I collect but I need all that’s collected it’s an on going battle. ;) i've asked them to create maximum size logs of maybe half a gig and auto zip, but they feel that this is still a size problem as the product that shows the live data needs the logs to be uncompressed.

So no way of stopping the log file getting to the size it’s got to be then. Everything is optimized in the loggin settings so it only records the bare minimum/essentials.

Any ideas?

Vimes.

Dijkgraaf

7:55 pm on Mar 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It sounds like the Admin boys are agitating for a larger and better server to me :-)

ronburk

12:55 am on Mar 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



an example of how I can filter out the images

Your Admin boys should be able to configure that for you on the server side (you didn't give any clue as to what server or OS is being used, so I assume you're not looking for server config info). You'll make their life easier if you arrange to store all your graphics (and only graphics) beneath a single directory tree.

If you're like most web sites, just filtering out the graphics will chop the log size in two, at least. If that isn't enough, maybe you've got some .css and .js files referenced a lot that you can stop logging as well.