|Data reseller bot shortlist?|
I've been managing an htaccess blacklist of bots with varying degrees of success, unfortunately I tend to add a bot after it has done the deed so to speak. Lists like the one on askapache helped start the list but it has evolved over time. [askapache.com...]
I also manage a whitelist of bots in robots.txt though most bad bots don't bother reading it much less complying.
I've decided that I want to add a few reputable bots to my htaccess blacklist because I've caught them ignoring my robots.txt. Not only that but it dawned on me that these "reputable SEO data gathering resellers" are selling my stats to my competitors.
I'm ready to toss the entire range of SEO bots to the curb but I only know of a few. example: ezoom (seomoz), ahrefsbot and sitebot don't bring me traffic but they get my stats sold on services I may not be using for this site.
Is there a complete list of known SEO data gathering site bots? Also, since they may not always declare themselves, is there a list of IP ranges for these types of data reseller SEO sites?
I love the services, and the companies behind them, but aggregating and selling my stats to my competitors? Pass.
"third party" services are more broad than SEO data gathering services.
You need to consider content filters, Universities and even more to complete the category. (even the K-12 providers are 3rd party and not always beneficial).
I'm not aware of existing lists, keeping them updated would be a lost cause.
Most of us block unfavorable companies by IP range along with a UA list of tools (downloaders, email address scrapers, etc) that anyone can use.
For a true whilte list, you have a list of allowed browsers and bots (verified IP ranges and/or header info) then block everything else. This takes a higher level of programing skill.
Difficult to post a list since not all are unfavorable to everyone.
The lists don't stay static. When the bots find themselves ineffective because of the number of entities blocking them, they move. The number of servers that want to host unsavory operators is not unlimited so they do sort of gravitate to certain servers that you'll find discussed here. If you add other methods to this it is more effective.
I use a trap I got here in WW Forums to catch robots that don't follow robots.txt instructions and use those to look up IP ranges to block. I get others from this most helpful forum and use my access logs to see who is misbehaving and needs to go away. If you are using Mac, there is a handy app that gives you whois services. You will have a more useful list if you build it yourself but even your own list needs to be updated once in a while.
I don't think there is a complete list of those, but what I started doing on weekly bases several years back was to collect IP Addresses of the sites that the info is hosted on and add the IP ranges to a block list.
|When the bots find themselves ineffective because of the number of entities blocking them, they move. |
Reputable ones seems to stick with reputable hosting companies. Scrapers seem to move around but easier to catch based on headers. Catch an IP scraping, block the range and all the ranges that belong to the hosting company. It is a never ending task till one starts understanding how things work on the darker side.
Get a few forums going out there, spread the word that comment spamming is all good to go, setup some traps, collect all the data one can, learn, learn, learn, start blocking IP Ranges on your main sites.
Visit STOPFORUMSPAM, ProjectHoneyPot, .... get into API, lovely stuff...
Lot of work, but PAYS to discover :)