homepage Welcome to WebmasterWorld Guest from 184.72.69.79
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
Forum Library, Charter, Moderators: Receptional & mademetop

Website Analytics - Tracking and Logging Forum

    
Reading and Understanding Raw Logs?
SeK612




msg:892395
 12:50 pm on Jan 6, 2006 (gmt 0)

A short summary of the reasons for this post are that several days ago on New Years Eve, my site was suddenly hit by many error messages, making it virtually unbrowsable.

My host is pretty poor on the support side but a itermittant line of communication has been setup. So far I've established that the problems are occuring because my site is recieving too many (literal) hits.

I've found this hard to believe as my site is very small in terms of internet traffic. My host's stats show that my site is recieving around 1 million literal hits a day (and uses about 2GB of bandwidth).

The problem is my external stat program shows that on a good day I only recieve about 1000 unique hits (and about 5-7 thousand page views). Since my site is only text and small images I'm struggling to find out how this equates 1 million hits that my hosts are recording.

So far I've begun trying to rule out bad scripts and have removed single mailing script that was causing a few problems. This hasn't seemed to stem the hits though. I also tried temporarily excluding all bots to the site. This didn't make a difference in the hour or so they were excluded for.

Aside from this the only other feature I make use of is PHP includes. These are used fairly extensivly but I'm not entirely sure these could be causing the massive hit difference.

This leaves checking my logs to try to find out what's going on. The logs are raw and in two forms, error's generated and account activity. I've attempted to sift through them (sift being the appropriate word as the mass of hits is causing them to be generated at a rapid level) but can't really understand them.

Essentially I'm looking for advice on how to understand these logs so I can pinpoint exactly whats causing the hit count to rise so quickly.

 

zCat




msg:892396
 1:27 pm on Jan 6, 2006 (gmt 0)

What format are the logs in? Can you post an example?

SeK612




msg:892397
 3:00 pm on Jan 6, 2006 (gmt 0)

I'm not sure about the file format. It can be read as text though.

A snapshot of the error logs is as follows, I've edited out some of the private and domain specfic info. The 217 ip address resolves as my host server.

[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [error] [client 217.***.***.***] File does not exist: Path/Sites.txt
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [error] [client 217.***.***.***] File does not exist: Path/Sites.txt
[Fri Jan 6 11:26:06 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [error] [client 217.***.***.***] File does not exist: Path/Sites.txt
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [error] [client 217.***.***.***] File does not exist: Path/Sites.txt
[Fri Jan 6 11:26:06 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [error] [217.***.***.***] File does not exist: Path/Advert.txt
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:06 2006] [error] [client 217.***.***.***] File does not exist: Path/Advert.txt
[Fri Jan 6 11:26:07 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:07 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:07 2006] [error] client access to www.idris-c.com deferred, MaxClients 30 exceeded
[Fri Jan 6 11:26:07 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:07 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:07 2006] [error] [client 217.***.***.***] File does not exist: Path/Advert.txt
[Fri Jan 6 11:26:07 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:07 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded
[Fri Jan 6 11:26:07 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:07 2006] [error] [client 217.***.***.***] File does not exist: Path/Advert.txt
[Fri Jan 6 11:26:07 2006] [notice] cannot use a full URL in a 401 ErrorDocument directive --- ignoring!
[Fri Jan 6 11:26:07 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded
[Fri Jan 6 11:26:07 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded
[Fri Jan 6 11:26:07 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded
[Fri Jan 6 11:26:07 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded

A snapshot of the account log. Again the 217 are all the same ip address which is my host server. The others appear to be general visitors. MaxClients seems to be a lock to prevent to many hits at one time.


217.***.***.*** Account - [06/Jan/2006:06:14:54 +0000] "GET /Error/Error404.php HTTP/1.0" 200 1865 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Counter.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Advert.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Counter.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Sites.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Error/Error404.php HTTP/1.0" 200 3460 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Error/Error404.php HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Counter.txt HTTP/1.0" 200 270 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Error/Error404.php HTTP/1.0" 200 76170 "-" "-"
68.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Image Path/ referr details etc
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Sites.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Counter.txt HTTP/1.0" 200 270 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Error/Error404.php HTTP/1.0" 200 79900 "-" "-"
68.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Image Path/ referr details etc
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Counter.txt HTTP/1.0" 200 270 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Error/Error404.php HTTP/1.0" 200 1865 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Sites.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:55 +0000] "GET /Sites.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Sites.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Advert.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Advert.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Advert.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Advert.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Advert.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Advert.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Advert.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:56 +0000] "GET /Sites.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Advert.txt HTTP/1.0" 302 293 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Error/Error404.php HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Sites.txt HTTP/1.0" 503 388 "-" "-"
202.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Image Path/ referr details etc"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Advert.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Counter.txt HTTP/1.0" 200 270 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Error/Error404.php HTTP/1.0" 200 1865 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Advert.txt HTTP/1.0" 503 388 "-" "-"
217.***.***.*** Account - [06/Jan/2006:06:14:57 +0000] "GET /Sites.txt HTTP/1.0" 503 388 "-" "-"

larryn




msg:892398
 4:26 pm on Jan 6, 2006 (gmt 0)

Sek,

These are standard Apache log files in what looks like a bastardization of 'combined' format. The meaning of the fields can be found at the Apache.org [apache.org] web site:

Log Format Layout:
[httpd.apache.org...]

Error Log Information:
[httpd.apache.org...]

I said 'bastardized' because, for some reason, your logs are not capturing referer information, instead putting '-' in the extended fields (referrer & agent), which is very strange as pretty much every browser & spider will identify itself (See Combined Log Format info at the Apache site), blank referrers are more common, but 100% blank suggests other issues. The referrer info is really helpful for any analysis, and the pretty much the only way to identify if your site is the target of log spammers [webmasterworld.com].

Also there is a problem in your configuration that is resulting in all those 500 series errors, fixing that will clean up the logs so you can better identify the external issues.

If you have more questions about your results, feel free to ask/sticky.

Larry

pageoneresults




msg:892399
 4:37 pm on Jan 6, 2006 (gmt 0)

[Fri Jan 6 11:26:06 2006] [error] client access to www.domain.com deferred, MaxClients 30 exceeded

I have a question. What is the above? Do you have some sort of limitations on the number of clients that can connect to something on your site?

SeK612




msg:892400
 5:04 pm on Jan 6, 2006 (gmt 0)

Well my fist question would be regarding the 217.***.***.*** hits. 217.***.***.*** is the address of the server rather than a visitors IP. It's seems that these are making up the bulk of the bloating of both logs .

Perhaps server replies and seem to be problems regarding the PHP includes I'm making use of. The method includes is as follows:

<?php include('http://www.domain.com/filewithcode.txt');?>

Some of the codes are missing the trailing semi colon ( ; ) at the end if that makes a difference.

Again I'm looking for reasons as to why my account usage is so high given that I only recieve a thousand unique visitors in a day.

The MaxClient is a limitation on the account which is causing the mass of physical errors on the page (which then cripple the site as a visitor sees the error every other page). As things stand my site is on a shared hosting account and the mass of literal hits is triggering whatever system is in place. When the error appears in the log file the user accessing the site at that point sees a server error page (503 error "The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.").

pageoneresults




msg:892401
 5:19 pm on Jan 6, 2006 (gmt 0)

As things stand my site is on a shared hosting account and the mass of literal hits is triggering whatever system is in place.

It almost sounds like a bot is wreaking havoc on your site. Maybe someone (sneaky) found the limitation on your account and is now running an automated process to bring down the site. Maybe that is happening to others on the server. Maybe it is all part of a master plan to get you to upgrade to a paid hosting plan? I'm not saying that is the case in your instance, but it is a valid concern. :(

zCat




msg:892402
 5:33 pm on Jan 6, 2006 (gmt 0)

If I understand the logs correctly, this could be part of the problem:

<?php include('http://www.domain.com/filewithcode.txt');?>

i.e. every PHP include generates a HTTP hit to your server. I'd guess this means every hit on your site is generating multiple other hits to fetch the includes, which your host might be (wrongly) assessing as external traffic.

I don't know whether that would account for all the traffic claimed. However you could have a bit of a security problem on your hands, as your code in "http://www.domain.com/filewithcode.txt" etc. is visible to the whole world (unless you've blocked it with htaccess etc.). Try rewriting your includes so the files are fetched from the filesystem, e.g.:

<?php include('/path/to/filewithcode.txt');?>

SeK612




msg:892403
 5:37 pm on Jan 6, 2006 (gmt 0)

I do feel that something is wrong, hence the topic.

I'm not sure about the host. Their reply to me voicing my concern was simply stating I need to upgrade to a Co-Location server (an increase in price from 5 to 50 a month).

They are responding though (albeit very slowly - the problems have been ongoing since about a week ago :() and seem happy to let my site sit there without pushing the upgrade that much (though the site is pretty useless at the moment since no-one can browse it sucessfully thanks to the errors).

I have tried to rule out some kind of bot or DDoS attack. The total exclusion of all bots via a robot.txt file didn't seem to have any effect so it's possibly something else.

Again another possibility is coding errors (like the PHP includes) are simply causing the server to tie itself in knots.

I'm really not sure though, which is why I'm looking for solutions. I suppose an upgrade would be feasible if I felt it was really needed (though the price increase is high). If something is going wrong I also don't feel I should simply pay to bury the problem, especially if it creeps back if traffic increases further down the line.

SeK612




msg:892404
 5:48 pm on Jan 6, 2006 (gmt 0)

However you could have a bit of a security problem on your hands, as your code in "http://www.domain.com/filewithcode.txt" etc. is visible to the whole world (unless you've blocked it with htaccess etc.).

They are, but the names aren't public (so adverts.txt isn't referrenced on a generated page as the code is included instead of the file).

Try rewriting your includes so the files are fetched from the filesystem, e.g.:

<?php include('/path/to/filewithcode.txt');?>

I can do that, but how does it make a difference? Is it not bad practice to have raw paths in a file (if you mean to have an reference such as accountname/WWWroot/widget/include.txt).

Literal URL's are also used to save on time and problems so the include code is kept the same rather than using relative URL's and having ../../../include.txt on one page and ../../include.txt on another page a level higher.

It's possible that these are causing the log inflations. The bandwidth for the sites are also high though.

Is it possible that 1,000 unique visitors and 5,000-7,000 page views could generate 2 GB of bandwidth a day, given that the pages are made up of text and images?

zCat




msg:892405
 5:49 pm on Jan 6, 2006 (gmt 0)

I'd rewrite those includes as I suggested ASAP. I'm not sure that is the cause of the problem, but doing it the way you have will certainly cause some problems somwhere along the line. If nothing else, it will certainly slow up your site, and looks like it might be causing you to DDoS your own site.

Is it possible that 1,000 unique visitors and 5,000-7,000 page views could generate 2 GB of bandwidth a day, given that the pages are made up of text and images?

It sounds unlikely, but don't quote me on that. How does your own stat software generate its figures, i.e. does it work on the raw logs?

[edited by: zCat at 6:01 pm (utc) on Jan. 6, 2006]

pageoneresults




msg:892406
 5:53 pm on Jan 6, 2006 (gmt 0)

The total exclusion of all bots via a robot.txt file didn't seem to have any effect so it's possibly something else.

The robots.txt file only applies to those bots that adhere to the standard. There are thousands of other bots that ignore it completely and will happily prance around your site eating up bandwidth.

pageoneresults




msg:892407
 6:00 pm on Jan 6, 2006 (gmt 0)

Is it possible that 1,000 unique visitors and 5,000-7,000 page views could generate 2 GB of bandwidth a day, given that the pages are made up of text and images?

Yes. Take a look at those pages that are being requested the most. Look at the file sizes. Multiply that by 5,000-7,000 and you'll see how much bandwidth they may be eating up.

P.S. These days, bandwidth usage shouldn't be too much of an issue. I'm really surprised that hosts are still limiting bandwidth when there is plenty of it available. I guess it all depends on how the host is set up and where they are set up.

SeK612




msg:892408
 6:15 pm on Jan 6, 2006 (gmt 0)

Amazing. I think it's fixed (or at least pages are now loading).

The log files are referencing some missing include files which I've realised is due to changes that were made to the main site, but not reflected on some of the deeper pages (specifically the custom error pages).

I simply updated these error pages with the changes and my site is accessable so it seems this was the problem (I've no real idea why though in terms of these few pages causing such a large amount of hits and problems) .

Still hopefully this will see the sever problems go and see usage levels fall to a more normal level, plus it would appear that it means it wont cost me 45 extra each month :)

I'd still be keen for a bit more information about the possible security problems with my existing includes. Is the advice to have literal listings instead of absolute (so ../../include.txt instead of widgets.com/widget1/include.txt) or a sever path, as in the path to the files position via the sever (so Username/WWWroot/widget1/include.txt)

zCat




msg:892409
 6:40 pm on Jan 6, 2006 (gmt 0)

I'd still be keen for a bit more information about the possible security problems with my existing includes.

If the files are accessible to the outside world, and contain stuff such as passwords, you could have a problem . Even if they aren't directly linked publicly, you've now provided some interesting information for someone with a malicious bent to have a crack at your site (tip: use different nicknames when posting to online forums ;-).

Is the advice to have literal listings instead of absolute (so ../../include.txt instead of widgets.com/widget1/include.txt) or a sever path, as in the path to the files position via the sever (so Username/WWWroot/widget1/include.txt)

Judicious use of include_path and the like (you'll
have to google for more info, I don't use PHP all that much and have to look it up every time myself :-( ) will help you to write portable code along the lines of
include('widget1/include.txt') while avoiding includes via http (which will cause more problems than it solves; I'd only use it if there was no other option).

larryn




msg:892410
 9:21 pm on Jan 6, 2006 (gmt 0)

Sek,

Your PHP questions would be better answered in another forum, but for you to consider, this is the way I think is the best and proper way to access local files to use in your PHP pages:

<?php
$DOCUMENT_ROOT = $HTTP_SERVER_VARS['DOCUMENT_ROOT'];
require_once( "$DOCUMENT_ROOT/head.inc");
?>

By using the Document Root you are always starting at your sites root on the server, making the paths local. Also I think you need to reconsider your security concerns. Unless their is a bug in your page or a back way into your site, the server should alway process the PHP code out of the page before being delivered.

Larry

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved