Forum Moderators: DixonJones

Message Too Old, No Replies

How to determine the source of strange activity

         

cschults

9:42 pm on Jul 12, 2006 (gmt 0)

10+ Year Member



Here we're using WebTrends, AWStats and Google Analytics. Our log-based analytic programs are reporting an unusually high number of requests (10,000+/month) for a story we published back in 2001, while Google is reporting the expected amount (less than 50/month).

The top referrers reports are not revealing anything useful, which leads me to believe some sort of spider or robot is the cause of the activity that is not registered with WebTrends and AWStats.

Any suggestions for getting to the bottom of this?

Here is just a sample of what is in our logs:

208.179.xx.xx - - [05/Jul/2006:00:01:19 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
146.21.xx.xx - - [05/Jul/2006:00:02:30 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible;)"
80.178.xx.xx - - [05/Jul/2006:00:05:50 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
203.162.xx.xx - - [05/Jul/2006:00:06:09 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
124.106.xx.xx - - [05/Jul/2006:00:12:30 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
218.22.xx.xx - - [05/Jul/2006:00:12:48 -0700] "GET /news/maindish/2001/08/30/right/ HTTP/1.1" 200 27091 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

[edited by: jatar_k at 8:58 am (utc) on July 13, 2006]
[edit reason]
[1][edit reason] no specifics thanks [/edit]
[/edit][/1]

oxbaker

3:08 am on Jul 13, 2006 (gmt 0)

10+ Year Member



maybe its some other site related to that particular article. perhaps with a spider to make sure article links they post on their site remain current. are all the hits coming from related ips?

TXGodzilla

5:11 am on Jul 13, 2006 (gmt 0)

10+ Year Member



Copy & paste the IP address into the ARIN Whois page to determine the source of the traffic.
[arin.net...]

The addresses you listed are all from outside the U.S.
More than likely they are scraping content to be reproduced on another website.

If the same IP address does not access referring pages to get to the news content, they are obviously going straight to the articles or page. Check the history of the offending IP addresses in your logs. The worst offenders will have never accessed the common pages normally used to link to the content pages.

One small trick is to only allow requests to the news articles to come from your RSS feed or your site. A properly configured mod-rewrite with a referral check can block direct access to the content and only allow most non-exploit seeking & scraper website visitors to view the material.

Most would say this is not a huge problem. Those are the folks who have never watched a group of rogue bots suck 30GB of data EACH in a 24-hour period.