Forum Moderators: DixonJones
The top referrers reports are not revealing anything useful, which leads me to believe some sort of spider or robot is the cause of the activity that is not registered with WebTrends and AWStats.
Any suggestions for getting to the bottom of this?
Here is just a sample of what is in our logs:
208.179.xx.xx - - [05/Jul/2006:00:01:19 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
146.21.xx.xx - - [05/Jul/2006:00:02:30 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible;)"
80.178.xx.xx - - [05/Jul/2006:00:05:50 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
203.162.xx.xx - - [05/Jul/2006:00:06:09 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
124.106.xx.xx - - [05/Jul/2006:00:12:30 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
218.22.xx.xx - - [05/Jul/2006:00:12:48 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
[edited by: jatar_k at 8:58 am (utc) on July 13, 2006]
[edit reason]
[1][edit reason] no specifics thanks [/edit] [/edit][/1]
The addresses you listed are all from outside the U.S.
More than likely they are scraping content to be reproduced on another website.
If the same IP address does not access referring pages to get to the news content, they are obviously going straight to the articles or page. Check the history of the offending IP addresses in your logs. The worst offenders will have never accessed the common pages normally used to link to the content pages.
One small trick is to only allow requests to the news articles to come from your RSS feed or your site. A properly configured mod-rewrite with a referral check can block direct access to the content and only allow most non-exploit seeking & scraper website visitors to view the material.
Most would say this is not a huge problem. Those are the folks who have never watched a group of rogue bots suck 30GB of data EACH in a 24-hour period.