Forum Moderators: DixonJones

Message Too Old, No Replies

Reading Raw Server Logs- Most efficient Methods

raw server logs

         

captcontent

2:30 pm on Sep 14, 2010 (gmt 0)

10+ Year Member



I am trying to see what the pros use to read raw server logs. Everyone knows it is a tedious process at best but it is a vital web survival skill. Finding the most efficient method/process would be helpful to all.

The process I use, which is probably archaic at best, is as follows:

1. Log analyzer allows Raw Access Logs download from cpanel.
2. Download Current Raw Access Logs for example.com
3. Extract .gz file to directory.
4. Create text file from the .com file that was created from extract.
5. Import text file to spreadsheet.

The above allows viewing but it would be nice to hear about or know of a better way or system/software that helps in this process.

Any and all input would be helpful to many I am sure. Everything that can be done to combat rogue bots and help protect our content is appreciated.

lammert

12:49 am on Sep 15, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My main concern would be number 5. Spreadsheet programs are not designed to handle the large number of rows present in raw website logfiles. Older versions of Excel simply don't allow importing them (don't know about the most recent versions) and OpenOffice gets incredibly slow once the number of rows increases too much.

A database program would be a better option to store and analyze the logfiles, with the added value that database programs have better tools to sort and select specific rows than spreadsheet programs have.

Another option would be to do some preprocessing of the data on your server already with a scripting language like AWK for example, and only download suspicious loglines to your local computer, but that requires shell access to your server and some knowledge of scripting languages.

captcontent

4:59 pm on Sep 15, 2010 (gmt 0)

10+ Year Member



Lammert, Thanks for the input. I would like to find the most efficient process. I am in charge of trying to protect much original content and realize that I will be spending much time, daily, going over log files.

1. Database program- Where would one find such a program?

2. AWK scripting- I am just beginning to understand Regex so writing my own scripts will not work.

I like the sound of both but personally cannot do either. I do however need to get this done and really appreciate your help.

It seems, based on the current state of affairs on the web regarding rogue bots, scrapers etc. that there wouold be more out there. I have looked pretty hard and it is not easy to find.

I am very thankful for your suggestions.

Dijkgraaf

3:56 am on Sep 20, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



@lammert The latest version of Excel allows 1,048,576 rows, so should be able to cope with at least a day's worth of log unless you are a very busy site like Google.

But yes, it does have some limitations, and sometimes if your filter matches a large number of rows it can sometimes take a while to start responding again.

phranque

10:36 pm on Sep 20, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i'm with lammert here.
most of the time my log analysis is best accomplished with some pipe and grep, sort, uniq, cut and perhaps sed.
if things get desperate i write a perl or awk script and occasionally i will actually use a spreadsheet.

tangor

11:01 pm on Sep 20, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Suggest Access which handles billions of rows instead of a million and works with the "space" delimiter really well. Can also set up relation tables to filter out bots, images, etc. Roll your own code. This approach gets your hands dirty with grunt work, but does give one a more direct look.

Just remarking on the tool in my toolbox chest... there are others.

YMMV

captcontent

6:50 pm on Sep 25, 2010 (gmt 0)

10+ Year Member



I would really be interested in finding out more on where and how to source/create/utilize a efficient method. Luckly the site I have to monitor is new and traffic is small but growing daily and with traffic comes the pesky critters. I am trying to take a proactive approach and want to slam the door shut as soon as I can on all uninvited guests.

We have many , many pages of unique pages of quality content with 5-10 new pages coming online daily. The problem is only going to get worse and time is short. Specific techniques would be much appreciated.

In fact, I feel a method should be standard issue to the foot soldiers trying to beat down rogue bots. I have been out of the game for almost 5-6 years and am totally shocked at the sheer numbers of intruders these days. And you certainly can't rely on search engines to police right from wrong these days.

All input much appreciated!

enigma1

11:33 am on Sep 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



IMO your best bet is to handle them directly from a file and extract the information you maybe interested in. If you have full control of a server you should have the ability to specify what fields to log.

Depending on the server, sometimes the logs are so big (they go into dozens of Gigs where even browsing or editing with most popular apps is impossible (although you can write your own). So you need to decide how to filter the log entries and extract the ones you need. If it's to extract the user agent for example you break the file log first into smaller files as you will get hit in some cases by the 32 bit integer limit and then you apply a preg_match function to each file for the user agent string in question extracting the rows.

For rogue bots you could identify them from the logs and then verify them so you will know the IPs where they come from. Although this method is reliable it may have a very short lifetime. Many IPs can be dynamic or used by many, they can represent compromised systems on one day and clean systems on the next. Yet is perhaps one of the very few fields on incoming requests you can rely on.

Some of the identification can be done in real-time, other methods can be done later but each one has its pros and cons.
- Client Headers checks
- Access Patterns
- Timing on access requests
- Server to client checks

So if say you wanted to check if the request attempts to break something you could immediately validate the parameters of interest. You could check some headers also if the client supports a language or if it supports compression or if there is a proxy hint.

But if you wanted to check if the IP comes from another server which could be compromised at the time, you would need to start scanning ports or accessing databases for IP info, something not practical for real time request processing, as it consumes resources from the server, needed for your real clients. Then there is a chance the scan to backfire if the client for instance gets reports from his firewall on scans from your server, he may decide to abandon your site.

And another factor is the type of site you have. Is it personal, forum, business, e-commerce etc. With some types like e-commerce you may want to do very little as you cannot easily foretell if spiders trigger something and then you lose sales. In these cases securing the application is extremely important and then perform back end analysis from another system on the orders placed.