Welcome to WebmasterWorld Guest from 220.127.116.11
The Internet Archive was created in 1996 as the institutional memory of the online world, storing snapshots of ever-changing Web sites and collecting other multimedia artifacts. Now the nonprofit archive is on the defensive in a legal case that represents a strange turn in the debate over copyrights in the digital age.
Original Article [nytimes.com]
Alternate Link [news.com.com]
This case could have significant impact on caching and archiving protocol. I'm sure Google will be watching the case closely as they operate their own extensive caching system.
Ostensibly the case seems to hinge on robots.txt, but the real gist of this may be a decision on whether extensive caching is legal or not. Fair Use will be examined quite closely.
Last week Healthcare Advocates sued both the Harding Earley firm and the Internet Archive, saying the access to its old Web pages, stored in the Internet Archive's database, was unauthorized and illegal.
[edited by: Brett_Tabke at 8:30 pm (utc) on July 13, 2005]
[edit reason] fixed link - added quote [/edit]
joined:June 2, 2003
One copy on a server is not the same as it being displayed 1000 times a day. I would say that is 1000 copies. Especially since it is cached with IE and other browsers. So the one copy on the server is now 1000 copies on several computers.
Does that mean 1000s of people violated Time's copyright today by visting Time's website and then leaving the website without clearing out their cache?
What if their child opens the browser and clicks into the browser's history whilst offline? Is that the moment of copyright violation?
I feel a case of bad law coming on. Maybe it's time to renew my print yellowpages ad? ;)
putting up a web page constitutes more than one copy.
"Putting up" a webpage? That's a new one. I'll have to think abou that. I'm not certain the file or web page is "put up" by Archive.org. They do deliver the file.
I somewhat doubt that Archive.org has made 1000s of copies of any one web page, one for each potential visitor, so Archive.org "has only made one copy". That's what the statute - on it's face - looks to say.
What I'd like to read here is the distinctions people draw between Napster type file sharing and Archive.org "file sharing". How (why?) is Archive.org's "file sharing" different or distinct from Napster's file sharing problem?
Isn't that, in substantial part, what many of you are hinting at? "Hey, Archive.org made and now they're distributing copies of copyrighted material. How can they do that if Napster can't?
Looks like there are certain clear exceptions for archives, but what about the issue of "distribution"? Is an archive, to fit within copyright law, only to present the archive within the confines of the dusty shelves, even when the archive is of what was previously "distributed" via the WWW for public viewing?
Gotta take another look at what I first posted.
joined:June 2, 2003
ß 108. Limitations on exclusive rights: Reproduction by libraries and archives
(a) . . . it is not an infringement of copyright for a library or archives, . . . or to distribute such copy or phonorecord, under the conditions specified by this section, if ó
(1) the reproduction or distribution is made without any purpose of direct or indirect commercial advantage;
Might that be a "gotcha" to those saying the problem lies in the distribution of "the copy" via the WWW?
If Archive.org passed a book, rapidly, from house to house by speedy courier, would the distribution to multiple parties of the same copy not pass statutory muster?
I sense the caching issue, since it's really not an intentional act but an automatic and innocent browser action, is a non-starter.
So, I'm going to rest on my debating laurels for a moment, in blissful ignorance, whilst someone blows holes in my splendid analysis. ;-P
Ghost, put down that analytical wit! Receptional, back away from that argument by analogy! No, not Baked_Jake quoting the law!
How about this?:
The archive has only the one copy on their servers.
They are not distributing it at all.
They are, however, allowing others to copy it.
Discussions over the years have pored over topics like:
1) cache-as-copy, or is it legally/technically a "copy"?
2) if you put up an electronic copy of a file on a web server with the intention of allowing the public to access (download a new copy of) the file, are you not granting limited copy/store permission? And how far do those limits go?
3) How long can I wait before clearing my cache? How long does Google et al. have before their cache violates the intention of the publisher? And archivers?
How about this, too?:
Unless specifically indicated, there is no time limit on how long a digital copy can be retained by those for whom the electronic file was originally intended to be distributed. So ... are the public not the intended recipients of their copy of the file? And if so, how does the archive violate this, except perhaps by providing their copy for further copying beyond the original publisher's intent?
MP3s, MODs, HTMLs, whatever ...
IMHO, it is the USE of those files that gives the lawsuit its legs, not the simple fact that the copies exist. Building your business around other people's intellectual property is always a risky endeavour, even if the business is a non/not-for-profit.
block access to any older versions already stored in the archive's database before a robots.txt file was put in place.
does not indicate that use of a robots.txt file will purge old content from the archive unless there was no robots.txt file in place when the bot hit the site for original archiving. If the site had a robots.txt file in place that allowed access to the public areas of the site, then the site owner decided to block robot access to their site completely using robots.txt, that's not a request for a purge ... it's a request to stop adding new content to the archive.
If there were no robots.txt file on the site, then one day the owner instituted one, that's a different story.</edit>
JD, HealthCare Advocates was complaining about the IA bot hitting their site, and ignoring robots.txt and gathering those pages for inclusion in the archive.
Then block the bot, block the IP, it's web 101 and completely silly as I don't rely on robots.txt as it's meaningless for the most part. For services I need like Google and Yahoo there will be rules for what they should and shouldn't index, all others get dropped into the master list of banned and blocked IPs and agent names.
joined:Dec 29, 2003
Yp, that's test IMO. Here instead, they have made millions of copies off the books, and are giving them away for free.
Ergo, if someone takes a Ďcopyí of any other copyrighted material, and puts it on the web, using the same principal as above, it would violate copyright.
Donít get me wrong. Before the FL update, I had the timemachine banned from my site, after FL, I allowed it and still do to this day even though we have recovered from the FL update, so I donít have a problem with it.
It depends on if you call a served page a copy or not. And itís not up for us to decide. Thatís why we have judges.
I don't think they understand the internet at all. The robots.txt file does not *block* anything at all.
When you request a page from Archive.org, it makes a real-time check for an ia_archiver exclusion in robots.txt. If it doesn't find a robots.txt that includes an exclusion that is either a wildcard or specific to ia_archiver, it provides the file. If it times out after 20 seconds without connecting to the site, it also provides the file.
Archive.org does not purge your pages when you put an exclusion in robots.txt. All it does is block access to your pages. If your server is down, or if you sell the domain and the new owner doesn't use robots.txt, all your old pages that Alexa crawled will pop up again in Archive.org.
You can send a letter or fax to Archive.org and get a permanent block on your domains that is not contingent on robots.txt. They do this because they know they're on shaky ground. Still, Alexa will crawl your site, and I'm not aware of any way to stop Alexa, except through a htaccess block or route-table block of their crawler.
This is what I use on my Linux box to block the Alexa crawler:
/sbin/route add -net 18.104.22.168 netmask 255.255.255.192 reject