joined:Dec 1, 2011
You are correct.
I think web-site owners have been getting too complacent about protecting their content against scrapers.
Whether those content scrapers claim to be a search engines of various kinds or not.
Everyone and their mother (and sisters and 6-month old baby-brothers) today think they should create another data-collector site, a site selling competitor spying, link tracker site, "monitor what is said about you" site, or other. Whatever they "claim" they are thinking of being at this particular time (usually a lie on their web-site, if it can be found). The problem is that we never really know what they will do now or in the future with all that content and/or information.
Anonymous agents DO NOT get in on my servers.
SetEnvIf User-Agent "^$" bad_bot="NoAgentString~impersonator"
with various types of blocks for the bad_bot environment variable kills off the blanks.
If you do not even want to tell me who you are, you do not get in.
Similarly, if you are too stupid to change the Agent-String from the default string in the public code library you used to create yet another scraper, you do not get in either.
All the "Jakarta Commons-HttpClient", "Zend_Http_Client", "RomeClient", and other anonymous bots gets killed too. Anyone wanting to scrape without an accurate Agent String gets killed.
So does all the human impersonators (crawlers with only "human" type strings), when caught.
If you want to steal other people's content, the least you can do is say who you are, and how letting you steal it will help the site owner.
The higher the percentage of site owners that block them, the less value their databases will have.
Missing information and missing sites dilutes the value of those datasets and will cost them customers.