Recently someone posted a simple question that we haven't addressed lately:
WHAT's ALL THIS STUFF CRAWLING MY SITE?
Here's a list I came up with a while back that probably needs updating.
- Intelligence gathering Spybots
- Copyright Compliance
- Branding Compliance
- Corporate Security Monitoring
- Media Monitoring (mp3, mpeg, etc.)
- Myriad of Safe-Site Monitoring solutions
- Government monitoring solutions
- Content Scrapers (pure theft)
- Data Aggregators
- Link Checkers
- Privacy Checkers
- Web Copiers/Downloaders
- Offline Web Browsers
- Many open-source crawlers ie. Nutch and Heritrix
Before getting all paranoid, the government monitoring I mentioned is 3rd parties that mine data and sell their reports to various agencies which is completely legal without a warrant because the reports already exist. <wink> <wink> <nod> <nod>
Anyone got anything else to add to the list that I'm overlooking?
To highlight the problem, I have a domain that is basically a honeypot for bots just tot see what would hit and log everything I could detect wasn't human, which is pretty easy as no humans visit that site.
Here's the latest update:
I have to link to the report as it's just too big to import into WebmasterWorld.
Note all the highlighted entries that look like browsers that would slip past most user agent black listing which is why data center blocking is the only way to stop that nonsense.
Hope this answers the question of what's crawling on your site and why.