So archive.org_bot requests robots.txt where it is disallowed (as well as it's crime partner ia_archiver) then continues to crawl through my site including requests for css and image files. Oh and I forgot to mention, my domain has been excluded (by me) from the Internet Archive since the beginning of time.
|Webmasters: User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags. Read more about Robots.txt. We try to crawl at a pace slow enough to not disrupt normal web activity. You can learn more in the Wayback Machine FAQs. If you notice archive.org_bot behaving badly, please contact us at bot.archive.org |
Yeah right... so now blocked by ip ranges.
This thing has been a PITA for webmasters since it's inception. They've never honored robots.txt, despite what they claim.
Initially, there was a belief that it could be used to document copyright and resolve plagiarism issues, however that don't hold water. There also use to be a disclaimer on their website that they sold data by the terabyte, although that practice was stopped some years back.
A decade ago, they used a handful of IP ranges, which I failed to make notation of when adding to my denies.
At one point, I thought it would be sound administrational decision to grant them access.
Unfortunately I couldn't locate all their IP's to remove their "deny access label from my websites and on their website. Over multiple communications, they just couldn't understand, nor did they ever provide all the IP ranges they spider from so that I could remove the denies.
In the end, I'm glad it worked out that way.
There are three major widget websites that get traffic that is leap and bounds times the quantity of all other widget sites. Until a couple a years ago, two of the major widget sites had all their old pages on archive.org, however once the two majors implemented denies, all that old stuff disappeared from public access, which is what the majors intended.
I block the IP ranges more incidentally than specifically, since they are part of larger web farms.
I also block by UA, which catches not only the "legit" ones but chinese fakers as well.
Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer.
|Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer. |
Unlikely you will get one.
Currently have a long running battle with them... but goodness gracious, it is hard work against deep pockets (check out their political support).
Considering their part of the archival of complete books and/or complete libraries, and using the same high priced scanning machines that Google Books is using?
It is difficult to imagine they have time to communicate and appease a meager webmaster.
That's me :)