Welcome to WebmasterWorld Guest from 54.163.35.238

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

archive.org bot

bad bot

     

keyplyr

7:46 am on Dec 19, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





So archive.org_bot requests robots.txt where it is disallowed (as well as it's crime partner ia_archiver) then continues to crawl through my site including requests for css and image files. Oh and I forgot to mention, my domain has been excluded (by me) from the Internet Archive since the beginning of time.

Webmasters: User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags. Read more about Robots.txt. We try to crawl at a pace slow enough to not disrupt normal web activity. You can learn more in the Wayback Machine FAQs. If you notice archive.org_bot behaving badly, please contact us at bot.archive.org


Yeah right... so now blocked by ip ranges.

wilderness

7:48 pm on Dec 19, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



keyplr,
This thing has been a PITA for webmasters since it's inception. They've never honored robots.txt, despite what they claim.
Initially, there was a belief that it could be used to document copyright and resolve plagiarism issues, however that don't hold water. There also use to be a disclaimer on their website that they sold data by the terabyte, although that practice was stopped some years back.

A decade ago, they used a handful of IP ranges, which I failed to make notation of when adding to my denies.
At one point, I thought it would be sound administrational decision to grant them access.
Unfortunately I couldn't locate all their IP's to remove their "deny access label from my websites and on their website. Over multiple communications, they just couldn't understand, nor did they ever provide all the IP ranges they spider from so that I could remove the denies.

In the end, I'm glad it worked out that way.

There are three major widget websites that get traffic that is leap and bounds times the quantity of all other widget sites. Until a couple a years ago, two of the major widget sites had all their old pages on archive.org, however once the two majors implemented denies, all that old stuff disappeared from public access, which is what the majors intended.

dstiles

8:55 pm on Dec 19, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I block the IP ranges more incidentally than specifically, since they are part of larger web farms.

I also block by UA, which catches not only the "legit" ones but chinese fakers as well.

keyplyr

9:06 pm on Dec 19, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer.

tangor

12:54 am on Dec 20, 2013 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer.

Unlikely you will get one.

Currently have a long running battle with them... but goodness gracious, it is hard work against deep pockets (check out their political support).

wilderness

5:13 am on Dec 20, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Considering their part of the archival of complete books and/or complete libraries, and using the same high priced scanning machines that Google Books is using?

It is difficult to imagine they have time to communicate and appease a meager webmaster.

keyplyr

7:20 am on Dec 20, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





...meager webmaster


That's me :)
 

Featured Threads

Hot Threads This Week

Hot Threads This Month