Caches, Copyrights and Publishing FOREVER

Forum Moderators: goodroi

Message Too Old, No Replies

Caches, Copyrights and Publishing FOREVER

New thoughts

Clark

8:43 pm on Apr 15, 2006 (gmt 0)

The concept of copyrights and caches and fair use have been discussed quite a bit here in the past. And as there are battles in court, it is likely to be discussed some more. One aspect I haven't really seen covered is whether publishing something on the Internet once, even by mistake, means you made it public forever.

Say Google and Webarchive and other archiving organizations saw your site on a particular date. Do they have the right of fair use to keep some version of it forever?

I put this thread in Robots.txt because perhaps there should be a protocol to say, anything on this site can be cached for X days. Destroy after that.

Rosalind

9:04 pm on Apr 15, 2006 (gmt 0)

Say Google and Webarchive and other archiving organizations saw your site on a particular date. Do they have the right of fair use to keep some version of it forever?

They don't have the right to republish it even once without your permission. No-one does.

Of course you can block them from keeping a copy by specifying you don't want a cache kept:

The trouble with the robots.txt standard is that it's so inflexible. It might be a good idea to extend it, and personally I like the idea of having a cache expire after a specified time. But it will never happen, because the W3C like to keep it simple. Perhaps with good reason, because there are a lot of broken robot.txt files out there that bots will just choke on.

Lord Majestic

9:46 pm on Apr 15, 2006 (gmt 0)

They don't have the right to republish it even once without your permission. No-one does.

In many jurisdictions books/newspaper/etc publishers are legally obliged to provide copies to certain Govt entities (think its Library of Congress in the USA), more importantly anyone can visit local library to read that content for free and photocopy it under fair use.

This is no different to what search engines do, the only difference is that content owners are not legally obliged to provide unrestricted access to that content - you can actually disallow it via robots.txt, so on balance digital content owners online enjoy much bigger freedoms than their counterparts in real world.

Pfui

9:57 pm on Apr 15, 2006 (gmt 0)

Interesting questions, thanks. But, um...

Actually, under U.S. federal law, the "Doctrine of Fair Use [copyright.gov]" is a possible defense to a claim of copyright infringement, not a "right" per se, and it includes specific elements which must be met to be successful. And whether or not the elements are satisfied by an alleged infringer are the crux of legal causes of action, and the substance of many a heated courtroom- and message board-based debate. (And the latter rarely include common law copyright, as well as international considerations.)

Thus your post poses multiple sticky wickets, including legally semantic ones:) Because as-is, your questions are kind of like asking --

"If you left your front door open by mistake, and robbers came in and took things (or heck, even took pictures of things), do they have the right to keep some of what they took? Or perhaps there should be a law saying they must they destroy things after X days?"

So if I may shift things away from 'laymen and Things Legal' and more directly to 'webmasters/site owners and robots.txt'...

1.) Problem? What problem?

Ideally protective and/or guiding protocols already exist -- use robots.txt to curtail unwanted robots, and page-based tags, e.g., NOARCHIVE, to control wanted ones.

Alas there are numerous, oft' lamented, completely unsolved problems with robots.txt -- it's an arguably voluntary protocol; different engines have made up their own Allow/Disallow rules; it's increasingly ignored (if not used outright as a basis for abuse); and some files do not support HTML-based tags.

2.) Solutions? Alternatives?

Beats me.

I'd like to see robots.txt 'elements' standardized, and checking for -- and heeding -- robots.txt a required, unable-to-be-overridden part of ALL programs, from browsers, extensions and toolbars to every robot, crawler, spider, link-checker and what-have-you. Then again, I'd like all of See's Candy to be calorie-free, too:)

So until some 'body' somewhere either requires adherence to robots.txt, or somebody programs an easy-to-use Web whitelist (or blacklist) program, Apache's mod_rewrite module is my friend. FWIW

Lord Majestic

10:54 pm on Apr 15, 2006 (gmt 0)

"If you left your front door open by mistake

So you want to sue people saying that you made public content on your website by "mistake", thus accusing them of stealing your property? Well, it would be an interesting case in court for sure!

My post was about similar situation in real world - I am sure book/newspaper publishers don't like people photocopying or reading their articles for free in libraries, in fact I believe (and correct me here if I am wrong) publishers in the USA have to provide copies of every books they print to the Library of Congress at their expense, so sure as hell they don't like that, but this is done for a good reason - there are considerations that outweight copyright.

Same thing should and will happen with online content - its just a matter of time. So when you complain about archiver.org or whoever, say thanks that there are no currently laws that legally oblige you to submit your content and its changes to something like library-of-congress.gov - and take it easy :)

But I fully agree - further development of robots.txt is really necessary - this development should give better control to publishers over how their content is handled by automated programs like those that crawl for search engines.

Pfui

11:26 pm on Apr 15, 2006 (gmt 0)

Lord Majestic, I'm sorry but I have trouble following your legal reasoning (in both of your posts), so I still hope this thread becomes more than yet another debate about who thinks copyright means what, and where, and when, and how, and why -- all topics already entertained (and/or endured:) by the "Content, Writing and Copyright [webmasterworld.com]" Forum on a near-daily basis.

That said, it's nice to see that re the following, you and I very much agree:

"Further development of robots.txt is really necessary - this development should give better control to publishers over how their content is handled by automated programs like those that crawl for search engines."

Amen.

Now the question is -- developed by whom? And for what benefit beyond the praise of a million small site owners? (Hmmm... Make a product, sell it even for a buck -- hey, that's a lot of praise:)

Clark

1:34 pm on Apr 16, 2006 (gmt 0)

I have a problem with some of the analogies used.

Google is not a library. The Library of Congress is a government body. Public Libraries are not for profit. Google, Yahoo etc are. Webarchive is a non-profit arm of the for profit Amazon.com. But it isn't government-run.

And although sometimes people consider opening a website as being analogous to being a publisher, it isn't really quite the same.

You may have only a logo on your site. Are you publishing? Well that same logo may be what's in front of your physical store. And being a store is definitely not being a publisher, although someone can conceivably take a picture of your store's logo and send it to the Library of Congress.

Lord Majestic

12:18 pm on Apr 17, 2006 (gmt 0)

Now the question is -- developed by whom? And for what benefit beyond the praise of a million small site owners?

In theory any big search engine would benefit from better version of robots.txt and so will small site owners - it would allow to control traffic usage better thus reducing wastage.

Google is trying to achieve it via sitemaps, however even though its supposedly "open" standard, they chose to encourage people submit data directly to Google, thus its not useable by other SEs - little suprise that overall sitemaps did not succeed big time, at least they did not overtake robots.txt. Perhaps there are few webmasters who care about these details?

Clark: time will tell, for now courts seem to accept that what search engines do is under fair use - at least insofar as text search is concerned.