Welcome to WebmasterWorld Guest from

Forum Moderators: LifeinAsia & httpwebwitch

Message Too Old, No Replies

Internet Archive Named in Suit Over Archived Pages

Wayback Machine Thrust Into Modern Spotlight



7:44 pm on Jul 13, 2005 (gmt 0)

WebmasterWorld Senior Member digitalghost is a WebmasterWorld Top Contributor of All Time 10+ Year Member

The N.Y. Times is reporting that Internet Archives is being sued.

The Internet Archive was created in 1996 as the institutional memory of the online world, storing snapshots of ever-changing Web sites and collecting other multimedia artifacts. Now the nonprofit archive is on the defensive in a legal case that represents a strange turn in the debate over copyrights in the digital age.

Original Article [nytimes.com]

Alternate Link [news.com.com]

This case could have significant impact on caching and archiving protocol. I'm sure Google will be watching the case closely as they operate their own extensive caching system.

Ostensibly the case seems to hinge on robots.txt, but the real gist of this may be a decision on whether extensive caching is legal or not. Fair Use will be examined quite closely.

Last week Healthcare Advocates sued both the Harding Earley firm and the Internet Archive, saying the access to its old Web pages, stored in the Internet Archive's database, was unauthorized and illegal.

[edited by: Brett_Tabke at 8:30 pm (utc) on July 13, 2005]
[edit reason] fixed link - added quote [/edit]


11:10 pm on Jul 14, 2005 (gmt 0)

WebmasterWorld Administrator webwork is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

One copy on a server is not the same as it being displayed 1000 times a day. I would say that is 1000 copies. Especially since it is cached with IE and other browsers. So the one copy on the server is now 1000 copies on several computers.

Does that mean 1000s of people violated Time's copyright today by visting Time's website and then leaving the website without clearing out their cache?

What if their child opens the browser and clicks into the browser's history whilst offline? Is that the moment of copyright violation?

I feel a case of bad law coming on. Maybe it's time to renew my print yellowpages ad? ;)

putting up a web page constitutes more than one copy.

"Putting up" a webpage? That's a new one. I'll have to think abou that. I'm not certain the file or web page is "put up" by Archive.org. They do deliver the file.

I somewhat doubt that Archive.org has made 1000s of copies of any one web page, one for each potential visitor, so Archive.org "has only made one copy". That's what the statute - on it's face - looks to say.

What I'd like to read here is the distinctions people draw between Napster type file sharing and Archive.org "file sharing". How (why?) is Archive.org's "file sharing" different or distinct from Napster's file sharing problem?

Isn't that, in substantial part, what many of you are hinting at? "Hey, Archive.org made and now they're distributing copies of copyrighted material. How can they do that if Napster can't?

Looks like there are certain clear exceptions for archives, but what about the issue of "distribution"? Is an archive, to fit within copyright law, only to present the archive within the confines of the dusty shelves, even when the archive is of what was previously "distributed" via the WWW for public viewing?

Gotta take another look at what I first posted.


11:36 pm on Jul 14, 2005 (gmt 0)

WebmasterWorld Administrator webwork is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Okay, to reply to myself:

ß 108. Limitations on exclusive rights: Reproduction by libraries and archives

(a) . . . it is not an infringement of copyright for a library or archives, . . . or to distribute such copy or phonorecord, under the conditions specified by this section, if ó

(1) the reproduction or distribution is made without any purpose of direct or indirect commercial advantage;

Might that be a "gotcha" to those saying the problem lies in the distribution of "the copy" via the WWW?

If Archive.org passed a book, rapidly, from house to house by speedy courier, would the distribution to multiple parties of the same copy not pass statutory muster?

I sense the caching issue, since it's really not an intentional act but an automatic and innocent browser action, is a non-starter.

So, I'm going to rest on my debating laurels for a moment, in blissful ignorance, whilst someone blows holes in my splendid analysis. ;-P

Ghost, put down that analytical wit! Receptional, back away from that argument by analogy! No, not Baked_Jake quoting the law!


11:58 pm on Jul 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Interesting indeed, and that it finally got to a lawsuit, after all of these years!

How about this?:

The archive has only the one copy on their servers.
They are not distributing it at all.
They are, however, allowing others to copy it.

Discussions over the years have pored over topics like:

1) cache-as-copy, or is it legally/technically a "copy"?
2) if you put up an electronic copy of a file on a web server with the intention of allowing the public to access (download a new copy of) the file, are you not granting limited copy/store permission? And how far do those limits go?
3) How long can I wait before clearing my cache? How long does Google et al. have before their cache violates the intention of the publisher? And archivers?

How about this, too?:

Unless specifically indicated, there is no time limit on how long a digital copy can be retained by those for whom the electronic file was originally intended to be distributed. So ... are the public not the intended recipients of their copy of the file? And if so, how does the archive violate this, except perhaps by providing their copy for further copying beyond the original publisher's intent?

MP3s, MODs, HTMLs, whatever ...

IMHO, it is the USE of those files that gives the lawsuit its legs, not the simple fact that the copies exist. Building your business around other people's intellectual property is always a risky endeavour, even if the business is a non/not-for-profit.

<edit>Last note:

block access to any older versions already stored in the archive's database before a robots.txt file was put in place.

does not indicate that use of a robots.txt file will purge old content from the archive unless there was no robots.txt file in place when the bot hit the site for original archiving. If the site had a robots.txt file in place that allowed access to the public areas of the site, then the site owner decided to block robot access to their site completely using robots.txt, that's not a request for a purge ... it's a request to stop adding new content to the archive.

If there were no robots.txt file on the site, then one day the owner instituted one, that's a different story.</edit>


12:50 am on Jul 15, 2005 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

JD, HealthCare Advocates was complaining about the IA bot hitting their site, and ignoring robots.txt and gathering those pages for inclusion in the archive.

Then block the bot, block the IP, it's web 101 and completely silly as I don't rely on robots.txt as it's meaningless for the most part. For services I need like Google and Yahoo there will be rules for what they should and shouldn't index, all others get dropped into the master list of banned and blocked IPs and agent names.


12:54 am on Jul 15, 2005 (gmt 0)

>> If Archive.org passed a book, rapidly, from house to house by speedy courier, would the distribution to multiple parties of the same copy not pass statutory muster?

Yp, that's test IMO. Here instead, they have made millions of copies off the books, and are giving them away for free.


4:47 am on Jul 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Itís too complex for me, but I can say this. If I make a copy of a copyrighted MP3 and put it on the web, I think everyone agrees that it would be a violation of copyright no matter if I profit from it or not. If not, donít read on ;-))

Ergo, if someone takes a Ďcopyí of any other copyrighted material, and puts it on the web, using the same principal as above, it would violate copyright.

Donít get me wrong. Before the FL update, I had the timemachine banned from my site, after FL, I allowed it and still do to this day even though we have recovered from the FL update, so I donít have a problem with it.

It depends on if you call a served page a copy or not. And itís not up for us to decide. Thatís why we have judges.


3:00 pm on Jul 15, 2005 (gmt 0)

10+ Year Member

My question is this, if archive.org states on their site that you can remove your site by following said procedure, and you follow the stated procedure and your site is not removed from the archive, is the archive liable to you for any damages this causes you? They do state that they will remove your site if you follow their robots.txt instructions. Does that statement not obligate them and make them responsible for the removal of your site, and make them liable for any damages non-removal may cause you?


3:29 pm on Jul 15, 2005 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

>> ...using the Wayback Machine, made hundreds of rapid-fire requests for the old versions of the Web site. In most cases, the robot.txt blocked the request. But in 92 instances, the suit states, it appears to have failed, allowing access to the archived pages. <<

I don't think they understand the internet at all. The robots.txt file does not *block* anything at all.


3:17 am on Jul 16, 2005 (gmt 0)

10+ Year Member

Archive.org gets their crawl from Alexa, about six months later. Brewster Kahle started Alexa, and sold out to Amazon. He retains some access and legal rights to Alexa by virtue of the sales agreement. Archive.org is a nonprofit spin-off from Alexa. You can see their Form 990 at Guidestar.org. It shows that there are significant overlaps between Kahle's foundation, Archive.org, and Alexa. Archive.org is technically a nonprofit, but not by much.

When you request a page from Archive.org, it makes a real-time check for an ia_archiver exclusion in robots.txt. If it doesn't find a robots.txt that includes an exclusion that is either a wildcard or specific to ia_archiver, it provides the file. If it times out after 20 seconds without connecting to the site, it also provides the file.

Archive.org does not purge your pages when you put an exclusion in robots.txt. All it does is block access to your pages. If your server is down, or if you sell the domain and the new owner doesn't use robots.txt, all your old pages that Alexa crawled will pop up again in Archive.org.

You can send a letter or fax to Archive.org and get a permanent block on your domains that is not contingent on robots.txt. They do this because they know they're on shaky ground. Still, Alexa will crawl your site, and I'm not aware of any way to stop Alexa, except through a htaccess block or route-table block of their crawler.

This is what I use on my Linux box to block the Alexa crawler:

/sbin/route add -net netmask reject


3:58 pm on Aug 30, 2006 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Continued here:
This 40 message thread spans 2 pages: 40

Featured Threads

Hot Threads This Week

Hot Threads This Month