As at 7 Jan 2011, the Internet Archive's "Archive-It" crawler was also crawling/collecting, and
intentionally ignoring robots.txt disallowance rules, using the UA name:
Mozilla/5.0 (compatible; archive.org_bot
+http://www.archive.org/details/archive.org_bot)
That bot visited my site from "ia360934.us.archive.org" (reverse lookup matches Internet Archive's IP range), did access robots.txt, and was presented with a robots.txt page that said it was completely disallowed[1]. Nevertheless, "archive.org_bot" proceeded to access/download other pages on my site.
My research into why it may have disobeyed my robots.txt rules turned up the following:
"Starting January 2010, Archive-It is running a pilot program to test a new feature that allows our partners to crawl and archive areas of sites that are blocked by a site's robots.txt file."
[
webarchive.jira.com...]
And, on a related page they say the Archive-It crawler name is
"archive.org_bot", and:
"What should I do if a webpage I want to archive excludes Archive-It's crawler?
If you have asked a webmaster to unblock the Archive-It crawlers in the robots.txt file, but they will not do so, Archive-It is testing a new option (starting January 2010) that allows Archive-It partners to ignore the robots.txt block for specific sites. This feature will be made available to partners on a one-by one basis, so if you need to run crawls ignoring robots.txt, please get in touch with the Archive-It team and let us know what sites and why this is necessary. The feature will then be turned on for your account, and you will be able to specify which sites this should apply to."
[
webarchive.jira.com...]
Probably pretty much needless to say around here, that bot will get 403 Forbidden if/when it tries to visit anything on my site other than robots.txt in future. And I'm close to deciding to implement 403 to the entire IP range of "Internet Archive". (And, as some others have more or less remarked: I really can't comprehend how they think they can get away with what appears to be a blatant and specifically unauthorised infringement of copyright, although I suppose they reckon most webmasters/authors/content providers can't afford to sue them).
BTW, I don't think I've posted here before, or if I did it was way back in about 2006 - mainly because I don't think I've ever been able to hopefully provide helpful info before now, about bad bot behaviour etc. that someone else hadn't already posted.
I take the opportunity to also say thanks *very much* to numerous participants in this forum who've been very generous/helpful over the last 5 or so years in relation to how to write Apache re-write etc rules, and/or provision of relevant blocking scripts, and/or identifying bad bots. I'm not naming anyone, because I'd most likely inadvertently leave out someone's name who's posted very helpful info.
[1] For clarity, my sites's robots.txt page is/has been generated on the fly by a script since about 2006, and it only presents "allowed" to (an unpublished list of) whitelisted crawlers/bots. All the others get presented with a robots.txt page that tells them them they're completely disallowed.