A recent entry was found in my log file
80.252.XXX.XX - - [26/Feb/2004:23:57:21 -0800] "GET /robots.txt HTTP/1.1" 403 - "-" "CCGCrawl/1.1-dev (CCGCrawl 1.1; example.com; email@example.com)"
Blocked (403) because it falls within an existing portion of IP group already blocked.
Investigation of myworkbase.com revealed that it is custom crawler software. They do not specify whether it grabs images (often copyrighted) or whether it obeys robots.txt or meta tags. I left them feedback suggesting they provide info on those topics on their site and that they ensure it complies with those items, also enquired if it is speed-decent.
This is just a heads-up. If any versions of CCGCrawl are already out in public use they possibly may not be compliant, but whereas it could not access my site due to IP block in .htaccess I have no way to affirm. Possibly the software company will provide details, and may ensure compliance at least with later versions should this one not be so.
This crawler bot could be used by anyone for any purpose.
[edited by: engine at 12:41 pm (utc) on Mar. 1, 2004]
[edit reason] specifics removed [/edit]