Google cache raises copyright concerns

Forum Moderators: open

Message Too Old, No Replies

Google cache raises copyright concerns

Clark

9:09 pm on Jul 9, 2003 (gmt 0)

Everyone loves to write about Google:
[news.com.com...]

skipfactor

10:42 pm on Jul 9, 2003 (gmt 0)

What's wrong with meta "NOARCHIVE"? I've never used it, but am I missing something?

GoogleGuy

10:49 pm on Jul 9, 2003 (gmt 0)

For that matter, the nytimes.com has a robots.txt that excludes spiders from "/archives" and from archives by year, and Google respects that robots.txt file. We don't crawl those pages. I'm not aware of any archives from the NY Times that Google has crawled. But the article didn't mention any specifics. :(

jeremy goodrich

10:51 pm on Jul 9, 2003 (gmt 0)

Funny, the cache issue has been discussed (here) time & again for at least three years. Just skimmed the article, but are there any new items in it worth noting?

IITian

11:19 pm on Jul 9, 2003 (gmt 0)

In DMOZ directory I find many pages from NYT, with almost all, after a brief abstract of the article, asking one to pay money to read those articles.

I find it hypocritical that NYT after using DMOZ (and hence Google) to promote its commercial products FOR FREE, want to hassle Google for its temporary caches. I say, remove all NYT pages from DMOZ and Google directories. Let them pay $299/year for "review" by a well-known directory for each of those pages.

mrguy

11:29 pm on Jul 9, 2003 (gmt 0)

I bet the lawyers are just foaming at the mouth trying to figure out a way to make a buck on this.

If they don't want it cached, then put the tags on the page.

Problem solved. But that's to easy and nobody can sue anybody over it.

People put stuff on the WORLD WIDE WEB and then get upset when someone actually finds it.

For cyring out loud have all the sane people left the planet?

[edited by: mrguy at 11:30 pm (utc) on July 9, 2003]

Clark

11:30 pm on Jul 9, 2003 (gmt 0)

I skimmed the article also, but I knew it would be of interest here so I posted it anyways.

Actually there has been a lot of discussion in blog circles about the NYTimes losing out on Google traffic because of hiding their content and how blogs and other sites offer news-related content (flame away) and that NYT is shooting itself in the foot opening the door for independent journalists/news sites and *cough* bloggers to fill the gap. They didn't like NYT hiding it's content and then blaming the blog community for filling in the SERPS.

What's weird is that I thought ALL of the NYT stuff was behind a password anyways. I remember signing up for a username since the week it was possible to do so and my cookies always let me in.

GG, maybe it's a situation where there's a 7 day free version, that gets cached and then it goes for pay and stays in the Google cache for a while? Anyone who DOESN'T have a NYT username/cookies on know how NYT handles this for the outside world?

digitalghost

11:43 pm on Jul 9, 2003 (gmt 0)

Well, I may be a lone voice on this, but I absolutely hate the cache feature, I think it's a blatant copyright violation and the only reason it is still going on is because it hasn't been to court yet.

>>If they don't want it cached, then put the tags on the page.

Sorry, doesn't work for me. What if I came by and "cached" all your content? And then 50 more people came by and "cached" your content? Are you going to ask people to start loading up their pages with different "nocache" tags?

What if I decided to "cache" Google's SERPS and make them available on my site? No difference that I can see. In fact, what if people suddenly decided to "cache" all the content here at WebmasterWorld and slap it up on their sites? I think Brett and the people here would throw a fit.

There's a huge difference between indexing content and copying the content. Google is copying your content and allowing people to view it from their site. The snippet they provide is enough and falls under fair use. I shouldn't have to take an extra step to keep anyone from copying my content and making it accessible on their site. After all, it is my content, what in the hell is it doing on their site?

Visit Thailand

12:12 am on Jul 10, 2003 (gmt 0)

Agree 100% DigitalHost - we have added the Google no cache tag to every page, as I want to get the visitors to my site and not have them read the cache and then move on. But I did not see why I should have to tell everyone not to cahce.

Also I think it is wrong that sites with the no cache tag do not get a fresh date after all the date does not reflect the cache just the content.

Plus with the cache it often gives outdated information which is not what your site is about etc.

Clark

12:25 am on Jul 10, 2003 (gmt 0)

A legitimate reason for the cache is that it helps a user to determine if a page has been cloaked and fed different content to G than to the end user. But if you hold a copyright, why should you care about that reason? A positive experience might be to look at the cache if the server is down. But you make some good points. The cache has gotta go and eventually will.

Allergic

12:41 am on Jul 10, 2003 (gmt 0)

A bit obsolete for my point of view.
Ok if it was in november 2001, when Google start to index and cached Office and dozens of others filetype without previous statement. That was a huge mistake and I know the phone line at Google was red!

But every smart webmaster know the fact of, when you put someting on the web you can get grab by SE, except if you use robots.txt or Nocache and Noarchive meta.

I was a consultant for one of the biggest independant newspapers here last year and we manage the site in that way with small resume of article with noarchive and nocache with small resume to give the readers the chance to subscribe. In august 2002 (before the makeover) they receive about 3k visits/month from Google. In january 2003 it was 100k/month ;-)

GoogleGuy : The only thing is bogging me (and go amno for lawyers in this case), is why Google don't appear in anymore in 2003 in the WayBackMachine?

Clark

1:03 am on Jul 10, 2003 (gmt 0)

I have nothing against the feature personally. I think it's cool, don't use the nocache option, G can show what they want AFAIC. But the arguments provided are valid and it is copyright infringement I tend to think and I don't see how they can get around that in a court of law. They are taking someone else's content and publishing it on their site. The burden should be on them telling you how to ALLOW them to cache your content NOT on telling you how to DISABLE it.

Morgan

1:26 am on Jul 10, 2003 (gmt 0)

If Google's wrong then so is just about every ISP, caching pages for their user's speed. And there are a lot of other examples of conveniences that would be jeopardized if Google's cache was stopped. It'd be one thing if Google was somehow making it look like the actual content was theirs, or displaying out of context images or movies (their image search, I think, is worse than the page cache). They show where the page is from, and they don't alter it other than highlighting words from what I can tell.

I don't know the legality of it at all, but I hope it's not stopped.

digitalghost

1:34 am on Jul 10, 2003 (gmt 0)

>>It'd be one thing if Google was somehow making it look like the actual content was theirs,

So it's okay if I jack all your content as long as I make sure it doesn't blend in with my site? I just need to add some disclaimer?

This is DigitalGhost's cache of ht*tp://www.yoursite.com/.
DigitaGhost's cache is the snapshot that he took of the page as he crawled the web.
If you'd like to link to the original site, please, use my cache link.

Not to mention that sites using absolute positioning look like merde in Google's cache.

john316

1:40 am on Jul 10, 2003 (gmt 0)

On the bright side, google could save a lot of disk space.

dragonlady7

2:22 am on Jul 10, 2003 (gmt 0)

... which is why if you don't want your stuff cached you use the cache tag. I personally love the Google cache feature; it's saved my a** a number of times when the page I desperately wanted to see was offline temporarily or permanently and I just could not find the information anywhere else. And it was never a case of me finding something for free I'd have to pay for elsewhere; it's things that only one person ever made a webpage about and then they graduated and their student account was shut down or whatever.
I think Google's cacheing is invaluable for a number of purposes and I can completely understand why they do it.
I think the vast majority of google's users, both webmasters and readers, find the cache Google keeps an invaluable service. As a writer, I don't mind if Google keeps a cache because not only do they make abundantly clear the location of the original, they also leave the copyright information intact. If someone copies my content, puts it on their site, and removes any reference to its original location or author, then that violates my copyright and is just plain rude. But if someone makes a copy available in case my server's down? Guaranteeing that I show up for a search I'm relevant for? Allowing someone to read what I've written, even if my website's not up or my redirect isn't working?
I fail to see how that's a problem on general principle.
And if it is a problem, it's something you're likely to be aware of as a problem, and therefore something that you're likely to be able to proactively take steps to avoid. Adding a simple NOCACHE tag is a perfectly reasonable precaution to take, and anyone who thinks it isn't probably has a considerable amount to gain by the lawsuit.

digitalghost

2:56 am on Jul 10, 2003 (gmt 0)

>>not only do they make abundantly clear the location of the original, they also leave the copyright information intact

I don't know how to make this any clearer, but I'll certainly give it a shot. It doesn't matter if the copyright information is left intact or not and referencing the original doesn't give anyone permission to reproduce the content. Permission must be sought after and received.

Allowing anyone to reproduce your content without your permission sets a dangerous precedent. If you allow Google to reproduce your content without your permission how can you protect you copyright against others?

If there are people that conclude that merely keeping the copyright information intact and providing a link to the original frees the content "borrower" from copyright constraints then I'd like to know how they arrived at that conclusion. The courts certainly don't agree with that line of reasoning.

dauction

3:17 am on Jul 10, 2003 (gmt 0)

Agree with DigitalHost on this..

Google's stance is if you dont want to get shot..wear a bullet proof vest?!

Come on.. tell you what just stop taking people's property or Google will be the wearing wearing the vests.. (from the Law suits firing back)

Clark

3:29 am on Jul 10, 2003 (gmt 0)

To be 100% clear, I just LOVE the cache feature. I use it all the time. It is invaluable. I would hate to lose it. But I don't think they can defend it in a court of law.

There is a huge difference between an ISP caching a website and Google's cache. One thing is plumbing, the other is REpublishing. I say plumbing because digital data is not like buying a book. You don't physically take ownership of the electrons when you view a website. You are always looking at a cache on your computer when you deal with digital content.

But on an ISP you are typing xyz.com and you are receiving the same content you would if they didn't cache. Google's site does NOT look the same. At an ISP, if you try to buy a product, it isn't through a cache. If you use Javascript in the Google cache, it won't even work...

As SE aware people, we love the feature. I really enjoyed seeing a potential's date's blog. It was fascinating. But the injured party in the case of copyright infringement doesn't always feel as good as the party enjoying the benefit, just as my would be date would have not been happy that I found out how much she liked Tom Green.

And we have to be fair to the owner of the content. That's why I say the default should be caching turned off and you can put in a tag to request caching.

Just my opinion, and it's one I came to AFTER reading DH's post. I will personally lose out if the cache goes away but I think it's the right thing to do....

mcavic

3:38 am on Jul 10, 2003 (gmt 0)

Wait, wait, wait, can we stop a second?

people can sometimes call up snapshots of archived stories at NYTimes.com and other registration-only sites.

If that's true, where the heck did the bot get a password to access the registration only section?

If a bot can read it, humans can read it without the bot.

annej

3:55 am on Jul 10, 2003 (gmt 0)

I really hate the WayBackMachine. I don't want anyone ever to see my early attempts at HTML and if I take information off a page I do it because I no longer want it online. Fortunately it's pretty hard to find anything on the WBM.

The Google cache sort of bothers me as well. At least it gets updated pretty frequently but it is still copying my page. I don't see that the rare occasion when a site is down is worth having the cache. OTOH I won't block it. I just wonder if Google should rethink it.

skipfactor

4:14 am on Jul 10, 2003 (gmt 0)

Permission must be sought after and received.

Permission is granted or denied in the robots.txt and page meta tags right? I'm not a robots.txt attorney, but it seems to me that Google's database archive of their old stuff, of our old stuff, that was allowed via robots.txt, is forever harmless unless it can be proven that they violated the robots.txt.

And password-protecting the news is a stupid business model that's costing them more. I sort of feel sorry for the NYT SEO staff if there are any.

mcavic

5:20 am on Jul 10, 2003 (gmt 0)

I really hate the WayBackMachine. I don't want anyone ever to see my early attempts at HTML

It's a neat idea - kind of nostolgic to see what Yahoo looked like in 1996. But yes, my site sucked when I first wrote it, too.

Fortunately it's pretty hard to find anything on the WBM.

Only because most people don't know about it.

And password-protecting the news is a stupid business model that's costing them more.

I don't mind that they want to charge people for reading articles. But it seems futile. If I can't get free information from them, I'll just get the same info somewhere else.

is forever harmless unless

Yes, if you don't want it to be seen, then you shouldn't let the bots get to it.

djgreg

5:34 am on Jul 10, 2003 (gmt 0)

At first: I really love the the cache featrue, it has helped me a lot in finding information which has moved to another site and was deleted but was still available in the cache or if the site was temporarily down.

But I also understand digitalhost's opinion. If Google is allowed to cache your website and make this cache available to the public, why should others not be allowed.
The noarchive meta-tag is working the wrong way. There should be an tag like archive="yes" , so every site which has this tag in the head get's cached, all other sites don't get cached. This is a clear permission of the owner, that Gogle is allowed to cache the site.

By the way, what do you think about archive.org? Copyright violation?

IITian

5:39 am on Jul 10, 2003 (gmt 0)

>I really hate the WayBackMachine.

Actually, I love it immensely. I was doing research on a company and its executives and found stuff on WayBackMachine which helped me a lot. Great tool for researchers.

projectphp

6:16 am on Jul 10, 2003 (gmt 0)

Another thought, how would anyone lose money out of the Google Cache?

If the Cache shows "pay for" content, then that is cloaking, as Google can see it, but a user can't. Anyone doing so is braking Google's TOS, and should be banned, not complaining about copyright infringement. BEsides, in that case people have SPECIFICALLY taken an action to defraud Google, and then want to complain about Copyright theft. SHEESH!

Secondly, if you make money on adds, then you are likely rotating them, and they still probably show up on the cache, agian, no lose of income. NOTE: Imagine getting paid for a clickthrough from a Google AdSense link on a cached copy of your page. Is that even possible? Tripped out if it is!

Thirdly, why the fuss? This topic is dead and buried!

Morgan

7:09 am on Jul 10, 2003 (gmt 0)

Like I said, I don't know the legalities of it. And ISP caches are the same in my opinion, there are plenty that change the original content before you see it (adult filtered ISPs, ISP ad blocking, etc.). Unimportantly, I've seen a ton of pages launch popups from within the Google cache, so some JavaScript must work.

In any case, my personal hope is that the eventual end of this is maybe the idea that content posted on a public space like an unrestricted website is a lot like making a speech on a street corner and is public domain. I don't think it's particularly dangerous to other copyright issues either.

Isn't there a site republishing a lot of GoogleGuy's comments from this forum? Maybe we'll get to see if these copyright concerns hold water at all. I don't think they do, but it'd be interesting to see who owns GoogleGuy's public comments-- the forum operator or Google? Maybe I can put a chalkboard by the sidewalk for people to write poetry on and then defend my copyright on it.

Clark

7:15 am on Jul 10, 2003 (gmt 0)

Perfect example. Ask Brett what happened to that site.

Brett_Tabke

7:26 am on Jul 10, 2003 (gmt 0)

Google's "cached" pages are the most important thing that has built Google. Without the cache, Google is just another Teoma.

>If Google's wrong then so is just about every ISP

First, what google does is NOT caching and does not meet the definition of caching anywhere on the net. Placing a branding ad at the top of 3+ billion pages is not caching. Google has perverted the use of the word "cache".

We have looked at this issue indepth. Every net attorney I know I have asked about this issue. Atleast 1 dozen knowledgable internet and tech attorneys have told me flat out that Googles cache would not hold up in court. They have no legal legs to stand one. The "safe harbor" ISP exclusion exception of the DMCA is not applicable to Google since Google does not meet any definition of caching.

>What's wrong with meta "NOARCHIVE"?

Because it is Opt Out. You can't opt out on illegal matters. It's like saying if you don't have a sign on your front yard that says stealing is not ok, then any one can help themselves to your stuff.

>has a robots.txt

Is not an accepted web standard by any body. It has never been used in court nor even admisable evidence. Again, it is Opt out.

> not only do they make abundantly clear the location
> of the original, they also leave the
> copyright information intact

Yes, they put a black and white Google branding ad at the top of the page where the insinuation is that it is Googles content.

> cache issue has been discussed (here)
> time & again for at least three years.

Yep, first time I saw Google in 98 I asked Page **** on the page jacking.

"Many of us copyright lawyers have been waiting for this issue to come up: Google is making copies of all the Web sites they index and they're not asking permission," said Fred Lohman, an attorney at the Electronic Frontier Foundation. "From a strict copyright standpoint, it violates copyright."

In other words, Google has had 5 years to explain how they feel the cache is legal and why it is really used - they never have.

The Google cache is what built Google. No cached pages - no Google.

thewebboy

8:20 am on Jul 10, 2003 (gmt 0)

I love the google "cache feature." It is one of Google's strong points.

There are lots of services where you don't Opt-in that share your information without asking. For instant, a phonebook will display your name,address,phone number unless you Opt-out.

I like Google caching my sites, saves on bandwidth, provides a copy of the site just incase the server goes down tempararly. If you don't want your information "cached" then either 1) take it off the internet 2) no-archive tag 3) robots.txt

This 156 message thread spans 6 pages: 156