Forum Moderators: phranque
I realize there's many methods to spoof and still get my content, but I figure anyone smart enough to beat the more advanced techniques is smart enough to realize there's easier and safer ways to make money off the internet.
Can someone please give me a gameplan to fend off these losers?
Here is my latest user agent report:
MSIE 8472334 73.879%
Mozilla 1650366 14.391%
Mozilla (compatible) 337238 2.941%
msnbot 245095 2.137%
Wget 187283 1.633%
Opera 136936 1.194%
Teleport Pro 84723 0.739%
Googlebot 83936 0.732%
Java1.3.1 79265 0.691%
Mediapartners-Google 72397 0.631%
WebStripper 22580 0.197%
WebCopier v4.0 11125 0.097%
ia_archiver 10936 0.095%
InternetSeer.com 9075 0.079%
WebSnatcher 5514 0.048%
SiteSucker 4555 0.040%
Anarchie 4017 0.035%
Pompos 3570 0.031%
psbot 3397 0.030%
FlashGet
Then you only have to worry about allowing the legitimate bots through which can be done by IP or their UA.
a comprehenive robots.txt that knocks off all the major scraper programs?
Try: [webmasterworld.com...] ;)
Bear in mind that many scraper programs ignore robots.txt completely, so you will still need to look at IP banning etc. as well as the above. Also, don'T just copy the list - you need to make you own judgement about the bots listed.
Too slow ;)
Seeing all these scrapers in my logs with their stock IDs makes me think that these thiefs are especially stupid, since by not spoofinf right off the bat they are calling my attention into to takin protective measures..
Maybe a no-go from the start will be enough to dissuade a percentage.
<meta content="ARCHIVE" name="ROBOTS">
<meta content="ALL,INDEX,FOLLOW" name="ROBOTS">
In both cases these meta tags are specifying the default for all robots: in other words, if you want the bots to index, follow and archive, you don't need to specify it. Those tags are therefore not required at all.