Forum Moderators: DixonJones
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)
It was clearly a spider, but the problem is alot of my real traffic uses the same or similar useragent. I am not very savvy with all this stuff and the more I read the more I get confused. Can anyone kindly put this in lamens terms for me. This is a new site and I am concerned these theiving scrappers are going to get my content indexed before I do and thus I will be the one that gets penalized for dupe content. Any insight is really really appreciated.
you can filter bots from real users using linear regression (basically if the same user hit 1000 in 5 minutes you can be damn sure its a robot, noone can read that fast) But its harder with these agents because they resemble normal "human" use agents.
Thanks for the reply mate. I wish there was a way to simply allow google, yahoo and msn bots only while blocking all others via htaccess.
Wouldn't you want to be included in as many engines as possible, in case they become very popular? In general, there should be no need to exclude crawlers, unless they are using up too much of your resources, or unscrupulously scraping your content.