Internet Archive's Archive-It

Forum Moderators: open

Message Too Old, No Replies

Internet Archive's Archive-It

'[Subscribers] harvest, catalog, manage...'

Pfui

9:09 am on Dec 25, 2010 (gmt 0)

Haven't seen the "Archive-It" UA in years. Note missing spaces and fake (& no) Firefox version number in strings:

crawling206.us.archive.org

Mozilla/5.0 (compatible;archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html) Firefox/0.0

Variation on a theme:

Mozilla/5.0 (compatible;archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html) Firefox

robots.txt? Yes

Can't say as I like the Internet Archive [archive.org], a.k.a. UA --

ia_archiver-web.archive.org
robots.txt? Yes

-- a federally tax-exempt organization, selling access, etc., to our crawled content. (Which is one reason why I've Disallowed their access for as long as I can remember.)

Staffa

12:30 pm on Dec 25, 2010 (gmt 0)

Ditto, have it blocked for years.

I'm quite capable of archiving my own stuff ;o)

incrediBILL

7:28 pm on Dec 25, 2010 (gmt 0)

I'm not sure how they get away with wholesale copyright infringement and simply aren't sued out of existence.

Just because you call yourself an archive doesn't diminish my copyrights under the law, which includes the right to be archived or not, and copyright protection has never been opt-in, you're automatically copyrighted the minute you create it in the US.

So how does this site continue to operate with immunity to the basic IP laws of the land?

Pfui

11:59 pm on Dec 29, 2010 (gmt 0)

On a related note, Archive.org's faking their UA and ignoring robots.txt now, too:

vmcrawl201.us.archive.org
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A535b Safari/419.3

robots.txt? NO

incrediBILL

12:10 am on Dec 30, 2010 (gmt 0)

Do you run a mobile site?

Pfui

2:47 am on Dec 30, 2010 (gmt 0)

Nope.

And unlike real people using phones, it hit root/html and skipped the graphics. (I didn't have archive.org limited to robots.txt only; never needed to before. Do now.)

elngy

7:17 am on Jan 19, 2011 (gmt 0)

As at 7 Jan 2011, the Internet Archive's "Archive-It" crawler was also crawling/collecting, and intentionally ignoring robots.txt disallowance rules, using the UA name:

Mozilla/5.0 (compatible; archive.org_bot
+http://www.archive.org/details/archive.org_bot)

That bot visited my site from "ia360934.us.archive.org" (reverse lookup matches Internet Archive's IP range), did access robots.txt, and was presented with a robots.txt page that said it was completely disallowed[1]. Nevertheless, "archive.org_bot" proceeded to access/download other pages on my site.

My research into why it may have disobeyed my robots.txt rules turned up the following:

"Starting January 2010, Archive-It is running a pilot program to test a new feature that allows our partners to crawl and archive areas of sites that are blocked by a site's robots.txt file."
[webarchive.jira.com...]

And, on a related page they say the Archive-It crawler name is
"archive.org_bot", and:

"What should I do if a webpage I want to archive excludes Archive-It's crawler?
If you have asked a webmaster to unblock the Archive-It crawlers in the robots.txt file, but they will not do so, Archive-It is testing a new option (starting January 2010) that allows Archive-It partners to ignore the robots.txt block for specific sites. This feature will be made available to partners on a one-by one basis, so if you need to run crawls ignoring robots.txt, please get in touch with the Archive-It team and let us know what sites and why this is necessary. The feature will then be turned on for your account, and you will be able to specify which sites this should apply to."
[webarchive.jira.com...]

Probably pretty much needless to say around here, that bot will get 403 Forbidden if/when it tries to visit anything on my site other than robots.txt in future. And I'm close to deciding to implement 403 to the entire IP range of "Internet Archive". (And, as some others have more or less remarked: I really can't comprehend how they think they can get away with what appears to be a blatant and specifically unauthorised infringement of copyright, although I suppose they reckon most webmasters/authors/content providers can't afford to sue them).

BTW, I don't think I've posted here before, or if I did it was way back in about 2006 - mainly because I don't think I've ever been able to hopefully provide helpful info before now, about bad bot behaviour etc. that someone else hadn't already posted.

I take the opportunity to also say thanks *very much* to numerous participants in this forum who've been very generous/helpful over the last 5 or so years in relation to how to write Apache re-write etc rules, and/or provision of relevant blocking scripts, and/or identifying bad bots. I'm not naming anyone, because I'd most likely inadvertently leave out someone's name who's posted very helpful info.

[1] For clarity, my sites's robots.txt page is/has been generated on the fly by a script since about 2006, and it only presents "allowed" to (an unpublished list of) whitelisted crawlers/bots. All the others get presented with a robots.txt page that tells them them they're completely disallowed.

Mokita

1:23 am on Mar 3, 2011 (gmt 0)

Was just looking through the February logs for a domain we host for email only. The web root does not even have a home page, but does contain a robots.txt, which disallows all crawlers, and a 403 error page.

It is interesting to read the logs occasionally and see what is still trying to crawl regularly and which ones ask for and obey robots.txt.

One day, the UA "ia_archiver-web.archive.org" requested robots.txt, but then proceeded to ignore it by coming back a few minutes later and requesting the (non-existent) index page using this UA:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.8) Gecko/20100722 Ubuntu/10.04 (lucid) Firefox/3.6.8 +http:// www. archive.org/

That has earned their CIDR a permanent ban in all my sites: 207.241.224.0/20