Forum Moderators: open
Staying off the radar.
Search engines can store results in their "cache" for between a month and forever. As archiving improves, it will get harder to clean up what's been revealed. Rarely are leaks intentional: Somebody at work might post a file on a server to download at home, a wrongly configured server might make too much of a hard drive searchable or a Web site's password-protection might be flimsy enough to be accessible to search engines.
I don't have a problem with caching being an opt-out service, but I do have a problem with the fact that only way you can opt out is by adding/editing a meta tag on every page. For a dynamic site, this may not be an issue, but with static html it most certainly is. In addition, search engines now cache non-html files such as .doc files. The only way you can prevent such files being cached is to exclude them entirely using the robots.txt file. HOWEVER, because of its pathetic specifications, you are also required to place all such files in their own directory. THAT IS ABSOLUTELY BANG OUT OF ORDER.
Kaled.
A lot of argument here seems to centre around "cached" pages thinking that somehow disallowing full (with markup) page caching would totally prevent the original indexed content to be reconstructed on user's request.
That’s right - it is possible to recreate textual content (ie words in the order they were used on the original page) of a page from search engine indices, albeit without all fancy HTML formatting (some might be retained and shown though). This won't be as nice looking as a full cached page as we know it, but this can provide sufficient information to display to the end user even for sites that opted out of "caching".
So I am asking - what’s next? Perhaps trying to argue that search engines should not try to display recreated content of the page from their indices as it is somehow wrong and against wishes of whoever made this information public in the first place?
Personally I used cached pages only at times when the sites in question were down (sometimes permanently), I think allowing all public pages to be cached by default is a fair compromise between free traffic that natural listings generate for said websites and needs of search engine users to access said pages in times when they are no longer up.
The bottom line is that if your public document was indexed then that’s it, the only way around is not to have that document indexed in the first place, which in other words means not making it publicly accessible. A perfectly reasonable compromise in my view.
I guess we are going in circles, though that wasn't the original topic of this thread.
A provision in the Digital Millennium Copyright Act (DMCA) includes a safe harbor for Web caching. The safe harbor is narrowly defined to protect Internet service providers that cache Web pages to make them more readily accessible to subscribers. ..... Various copyright lawyers argue that safe harbor may or may not protect Google if it was tested.
"We've evaluated this from a legal perspective, including copyright law, and have determined that Google's cached page service complies with the law," a Google spokesman said.
So since it was available in 1997, no lawyer has "tested" the legality of the Google page cache. That may speak for itself.
also see:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
Suppose a visitor to your site likes it, and decides to put a link on her site. But in addition to linking she stores a static version of your site on her server, and includes a second link: "Click here for my copy of the site."
How many would agree that anyone whose site is publicly accessible has no right to complain about this?
Click here for my copy of the site.
But it's not a copy of your site. Go to a Google cache page, and try to navigate using it. Do the internal links go to the cached versions of the other pages? No - they all lead to your site. So, instead of just giing one link to your site, they are giving dozens of links to every page in your menu, header, footer...
What's the problem with that?
As you can't arbitrarily ban caching for spiders or bots (because a definition of what constitutes a bot is forcibly arbitrary), you have to ban caching for everybody - including in their browser cache. You also can't ban spiders from accessing your site either, for the same reasons - you'd have to ban everybody.
"We've evaluated this from a legal perspective, including copyright law, and have determined that Google's cached page service complies with the law," a Google spokesman said.
TOSH .... Google's cache service is patently outside the law. If they were caching music belonging to Madonna, etc. law suits would have proven this. However, because they haven't yet upset anyone with deep pockets Google will continue their caching service - actually I quite like it, but it is not legal.
Incidentally, under no stretch of the imagination could Google be considered an ISP. Only an idiot lawyer with zero comprehension of the way the internet works could possibly think otherwise.
Kaled.
Google's cache service is patently outside the law.
No, it's clearly within the law. Google caches pages, not sites, and in accordance with standard caching rules as well as those of fair use.
Nobody is seriously disputing the legality of it. Google couldn't have done its IPO with a legal threat of that nature hanging over them. In fact, I don't believe they make the slightest mention of it in their IPO disclosure documents.
Let's not get into the politically-charged arena of music downloads, shall we?
Google couldn't have done its IPO with a legal threat of that nature hanging over them. In fact, I don't believe they make the slightest mention of it in their IPO disclosure documents.
From the prospectus:
We have also been notified by third parties that they believe features of certain of our products, including Google WebSearch, Google News and Google Image Search, violate their copyrights. Generally speaking, any time that we have a product or service that links to or hosts material in which others allege to own copyrights, we face the risk of being sued for copyright infringement or related claims. Because these products and services comprise the majority of our products and services, the risk of potential harm from such lawsuits is substantial.
No, it's clearly within the law. Google caches pages, not sites, and in accordance with standard caching rules as well as those of fair use.
So if I designed a system that copied other people's material, displayed it on my own site that would be ok provided that I used the defence in court that "I only copy pages not sites".
I don't know whether to laugh or cry!
If Google are so utterly confident that they are within the law with respect to caching, then answer me this question. Why does Google not display adverts at the top of the cached page?
The answer is simple : suddenly everyone would agree that caching was a breach of copyright.
Incidentally, what happens with respect to existing adverts on cached pages, pay-per-click, etc. Are Google 100% certain that no-one looses any money?
Kaled.
Kaled - the point about caching pages not sites is that the main SERPs shows one link to your site and one link to the cached image of that particular page. If you choose the link to the cache, you are presented with a copy of the page, with all the internal links rewritten to provide dozens of straight links to dozens of other pages on your site. Google is doing some great marketing for you via the cache as well as via the SERPs, for free. What's the problem again?
Why does Google not display adverts at the top of the cached page?The answer is simple : suddenly everyone would agree that caching was a breach of copyright.
Of course they would. But the situation would be different. They don't, so it isn't.
I thought we were having a discussion among adults and not a 5 year old "crybaby" conversation.
Sure had me fooled. Personally I think the whole debate is kinda silly, as Google isn't doing you any harm - any good business person knows there is a difference betweeen a LOSS and a NON-GAIN. They're only profiting from your content by LINKING to your content, not by displaying it themselves. If you understood anything about how search engines work, you would realize that they have to cache your pages and documents in order for them to be properly indexed and come up in the searches when people are trying to find you. Google is providing a valuable service to you by linking to you, if you want them to stop they'll be glad to.
But anyway, to all of you people who oppose the way Google manages their cached pages: Would it make a difference if they didn't offer the "cached" link on their site? They would still of course HAVE your page cached, as it's necessary for the search engine to work properly, but what if they weren't showing that link to anyone? Would you care as much then?
If Google are so utterly confident that they are within the law with respect to caching, then answer me this question. Why does Google not display adverts at the top of the cached page?
Maybe because they realize that doing so would be unethical?
It might be illegal, too, of course, but that's a question that has yet to be addressed in court--just like the ad frames that About.com and a few other big sites display above linked third-party pages.
Why hasn't the issue of Google caching been addressed in court? Probably because there aren't any lawyers who are willing to take such a case on a contingency basis. After all:
- It's hard to see how caching is hurting anyone;
- Google isn't profiting from the cached page (on the contrary: it's spending money to cache the page);
- Web publishers can opt out of caching with the simple addition of a meta statement; and...
- Google could legitimately argue that caching is a benefit for the Web publisher, since it ensures that users will be able to see what a page is about even if the publisher's site is down (in which case the user may choose to return later).
The web is Allow All by default - by placing a document on a public server running a service on port 80 serving such documents you have given permission.
This is very nicely put. But permission to read is not the same as permission to promote or republish. It's like having a wild party on the beach with friends and finding the pictures on the front page the next day along with your address and phone number. The party was for those who were there, not for the man in the street.
Back to the web - no doubt a 'normal' static or dynamic website (html, php, etc) is usually meant for the general public and there is no harm in indexing them unless the owner says no. But for other types of documents this may be different and I still think they should be left alone - unless the owner decides otherwise.
This way non savvy hobby webmasters don't have to wait until it's too late. We all know Google doesn't just follow links but the toolbar as well. Savvy webmasters who want their stuff indexed know what to do: make all docs searchable.
And some data simply souldn't be put on the web in any form - that's also the publishers responsibility.
Using a cached to display the image or to spider the page is completely different in every way! Why? Because it's for personal use and is not being displayed as their own service.
Personally I couldn't care less about this all topic, just long as they provide traffic to my web site, but it's not about how you or I feel, but about if they should be allowed to display it without my permission.
Google could legitimately argue that caching is a benefit for the Web publisher, since it ensures that users will be able to see what a page is about even if the publisher's site is down (in which case the user may choose to return later).
Let's say I created a site and it's copyrighted, they (Google or Yahoo!) than use it to display the cached to their users without my permission, that is against copyright law!
No, it's fair use - as you have specifically permitted it. Is the title of the page copyrighted? The description? If so, then just displaying it in the SERPs would be illegal if it was not for fair use. If you don't want it used, why was it published? If you don't like the cache, why do you accept that the page title and description are used in the SERPs? Am I flouting your copyright if I keep a copy of your page in my browser cache or keep a bookmark using the page title?
Copyright law was designed to protect the economic interests of writers.
No, it was designed to protect the public interest by "To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;"
Unfortunately, not many seem to remember this anymore.
As for those of you that are convinced that it is stealing, it is categorically NOT stealing. You cannot steal a copyright, you can only infringe it.
It is legal for Google to read everything posted on the web, just as it is legal for anyone to read copyrighted material stapled to a telephone pole.
Commercial use is only one factor considered in the Fair Use analysis.
Almost everything Google does falls well within allowable use of Fair Use. Displaying cached pages is just about the only thing that could possibly get them in trouble. But before you decide that it definitely does qualify as infringement, I suggest that you read Title 17, section 512. They WILL use this as a defense (who knows if it will work).
And before you decide that you are going to press the issue, have you made sure that you are going to be able to make enough from the suit to pay your lawyer? Most book an music publishers have, most web publishers have not.
You might also keep in mind that Judges use Google a lot, and they are likely to view an attack on search engines as being against the public interest. That is not something that you want to have happen if you are involved in a lawsuit, ask any lawyer. It doesn't mean that you will lose, but it doesn't exacly help your chances any.
Some here argue that just because it is "a page" that they chache not an entire website it constitutes a fair use. I wonder about two things:
What is the most basic publishing unit of the web? I would argue that pages are the basic unit and each page (with exceptions of course) constitutes a complete work. Much in the way that short story in an anthology could constitue a complete work even though it is only part of a book.
Even if the basic unit is the site and not the page that constitutes the complete work, doesn't google cache the whole site anyways?
If you wrote a 100 page book and I cut up your book and republished every single page in my 1000 page book (mixing the order up). Is that fair use?
And for the record, I am a googlephile and I also think absolutely anything on the web is fair game for downloading and deeplinking. But I still question whether serving the cache up for public consumption is legit.
yowza you keep comparing the cached content as selling your content? Where is a cached copy making someone money? Google does not run ads next to your cached content, neither does any other archiver that I know of.
Imagine going to the newspaper stand at 4:00 in the morning to get the first delivery. If I stamped my company's logo and phone number on the top of each page of all newspapers and put them back in the machine, would I be doing something legal. Are you saying that I wouldn't benefit from this form of advertisement on every page?
The owner of a newstand is benefitting by holding the newspapers until they are bought.
No, they are already purchased. Search on First Sale Doctrine.
Now, if it is considered Fair Use (I'm not claiming that it is) for Google to show that cached page, putting a distinct header above it is almost certainly *required* to meet the analysis.
Look again at what is in that header. It is all about giving credit and explaining that it is a cached page. They do not even link back to their own home page.
.htaccess is supposed to be a "hidden" file. Using default setups in most FTP programs, you can't even see it listed in a directory you have explicit permission to use.
These are CONTROL MECHANISMS to be USED by engines...not to be treated as "public pages". They display not only the policies of the websites they reside on, but they expose to the world which directories the webmaster wants to protect, greatly reducing the amount of time required to get this info by "brute force" methods commonly favored by "script-kiddie" hackers.
These documents, particularly .htaccess, are available to the public ONLY through the convenience of the search engine and its mistaken caching/indexing of them. They are NOT "public" documents, and should not be available for "public" review in this manner.
I would be perfectly happy to lock certain doors in my house. But if a banner is placed on my front lawn telling everyone which rooms I have locked, it kind of cuts into my security precautions, doesn't it?
A web server is NOT inherently a "public" place. Search engines assume we don't mind them building their business by telling people what's on our pages.
In most cases, this is correct. In some, it's not. (How many of you are running the default installation of MS "Personal Web Server" at your houses? We may be able to find out by doing a simple search...)
Hmmm...I think I'll start a new business by copying (not "stealing"...I'll leave the paper versions intact) all of the listings in every phone book, publishing the results (searchable, of course), and selling advertising for my income. After all, what is more "public" that a telephone listing?
There is much more to intellectual property law than is being discussed, here.