Welcome to WebmasterWorld Guest from 54.145.53.251

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

archive.org bot

bad bot

     
7:46 am on Dec 19, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5803
votes: 64




So archive.org_bot requests robots.txt where it is disallowed (as well as it's crime partner ia_archiver) then continues to crawl through my site including requests for css and image files. Oh and I forgot to mention, my domain has been excluded (by me) from the Internet Archive since the beginning of time.

Webmasters: User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags. Read more about Robots.txt. We try to crawl at a pace slow enough to not disrupt normal web activity. You can learn more in the Wayback Machine FAQs. If you notice archive.org_bot behaving badly, please contact us at bot.archive.org


Yeah right... so now blocked by ip ranges.
7:48 pm on Dec 19, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


keyplr,
This thing has been a PITA for webmasters since it's inception. They've never honored robots.txt, despite what they claim.
Initially, there was a belief that it could be used to document copyright and resolve plagiarism issues, however that don't hold water. There also use to be a disclaimer on their website that they sold data by the terabyte, although that practice was stopped some years back.

A decade ago, they used a handful of IP ranges, which I failed to make notation of when adding to my denies.
At one point, I thought it would be sound administrational decision to grant them access.
Unfortunately I couldn't locate all their IP's to remove their "deny access label from my websites and on their website. Over multiple communications, they just couldn't understand, nor did they ever provide all the IP ranges they spider from so that I could remove the denies.

In the end, I'm glad it worked out that way.

There are three major widget websites that get traffic that is leap and bounds times the quantity of all other widget sites. Until a couple a years ago, two of the major widget sites had all their old pages on archive.org, however once the two majors implemented denies, all that old stuff disappeared from public access, which is what the majors intended.
8:55 pm on Dec 19, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3091
votes: 2


I block the IP ranges more incidentally than specifically, since they are part of larger web farms.

I also block by UA, which catches not only the "legit" ones but chinese fakers as well.
9:06 pm on Dec 19, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5803
votes: 64




Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer.
12:54 am on Dec 20, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6137
votes: 280


Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer.

Unlikely you will get one.

Currently have a long running battle with them... but goodness gracious, it is hard work against deep pockets (check out their political support).
5:13 am on Dec 20, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Considering their part of the archival of complete books and/or complete libraries, and using the same high priced scanning machines that Google Books is using?

It is difficult to imagine they have time to communicate and appease a meager webmaster.
7:20 am on Dec 20, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5803
votes: 64




...meager webmaster


That's me :)