Is this common useragent a scrapper bot

Forum Moderators: DixonJones

Message Too Old, No Replies

Is this common useragent a scrapper bot

I am confused as always

snooprock

5:35 am on Aug 2, 2006 (gmt 0)

Ok folks, I am sorry as I know this has been covered time and time again. But I was hoping someone could break this down for me. I am getting the following useragent that hit everyone of my pages like a second apart.

Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

It was clearly a spider, but the problem is alot of my real traffic uses the same or similar useragent. I am not very savvy with all this stuff and the more I read the more I get confused. Can anyone kindly put this in lamens terms for me. This is a new site and I am concerned these theiving scrappers are going to get my content indexed before I do and thus I will be the one that gets penalized for dupe content. Any insight is really really appreciated.

oxbaker

7:58 pm on Aug 2, 2006 (gmt 0)

most likely its a bot on a windows system, using IE as the mechanism. But your right, that is a very typical user agent for anyone running windows with IE 6.

you can filter bots from real users using linear regression (basically if the same user hit 1000 in 5 minutes you can be damn sure its a robot, noone can read that fast) But its harder with these agents because they resemble normal "human" use agents.

snooprock

1:42 pm on Aug 3, 2006 (gmt 0)

Thanks for the reply mate. I wish there was a way to simply allow google, yahoo and msn bots only while blocking all others via htaccess. It just seems like we go about it backwards having to block all the bots we don't want instead of maybe just allowing google, yahoo and msn exclusively while blocking all others. It just seems retarted to me, but I don't know much about this stuff. Thanks again.

gregbo

8:53 pm on Aug 3, 2006 (gmt 0)

Thanks for the reply mate. I wish there was a way to simply allow google, yahoo and msn bots only while blocking all others via htaccess.

Wouldn't you want to be included in as many engines as possible, in case they become very popular? In general, there should be no need to exclude crawlers, unless they are using up too much of your resources, or unscrupulously scraping your content.