What are the potential risks of Google Cache?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What are the potential risks of Google Cache?

pageoneresults

3:45 pm on Jul 24, 2008 (gmt 0)

Can someone tell me if the Google cache may present some potential risks and/or vulnerabilities?

The reason I ask is because I am coming across more and more stuff related to cache and it concerns me. What really concerns me, is that I'm able to surf around and see stuff that is supposedly removed and not available to the public yet I can get to it via the cache. I know, that is the purpose of the cache. But in this case, the cache is a risk.

Are any of you using noarchive on your pages? I am and have been for almost two years now. All new sites going online have the noarchive directive in most instances. There might be a few pages we leave open for cache but not many.

Am I being too paranoid?

Are you aware of any unusual stuff taking place using Google's cache mechanism? Any scraping? Anything?

tedster

8:12 pm on Jul 24, 2008 (gmt 0)

I usually noarchive pages with sale prices on them - that's the extent of my current concern.

pageoneresults

8:30 pm on Jul 24, 2008 (gmt 0)

That's the extent of my current concern.

tedster, I'm surprised! I think we get too busy working with all the "other challenges" and overlook a few that may be potentially damaging in the long term.

I'd really like to discuss the concepts of Cache Surfing, Corporate Intellgence, Personal Snooping, etc. That cache "looks" like poison to me and I'm using noarchive again on almost everything. I feel safer for some reason and I need for my peers to tell me why I'm feeling that way. ;)

That cache does not serve my site visitor at all, none whatsoever. That would be my main reason for not allowing content to be cached, they don't click those cache links.

Now, if the above is the case, who then is using that cache and for what purpose? How was I able to gain access to some stuff that I should not have had access to through cache? This was stuff that should have been behind a login. Maybe a potential vulnerability within the site that was cached?

I have all sorts of questions about this because I believe that Google cache is poison for websites. It is being scraped, regurgitated, redirected, cloaked, you name it. So why would I as a website owner want to allow that to happen?

Also, what happens if you have a bad data push and don't realize it until it is too late. You surely don't want that being cached, do you? I need some help further understanding this. I'm traveling into the abyss, I know...

bwnbwn

3:59 pm on Jul 25, 2008 (gmt 0)

I too want to hear more on this as Brett suggested it as well.
[webmasterworld.com...] I can see some possible issues but need to futher understand the negative side of a cached page issue. Other than the just dad data showing.

pageoneresults

4:19 pm on Jul 25, 2008 (gmt 0)

I'm not too certain we are going to see much "in-depth" discussion on the potential risks of Google cache or any cache for that matter, this just doesn't apply to Google. That would be giving away a bit too much information.

I've come across some information that shows me exactly what Google cache "may" be used for. Heck, there are tools all over the place to surf the cache. Something ain't right there. And, I think the best suggestion is to add this to all pages if you just want to "feel safe" from it all. ;)

<meta name="googlebot" content="noarchive">

The robots term of noarchive will produce the following effect; Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document.

bwnbwn

5:09 pm on Jul 25, 2008 (gmt 0)

maybe Brett can have this brought up and discussed then at the pubcon

netmeg

5:29 pm on Jul 25, 2008 (gmt 0)

Funny this is posted today; we've just decided we're going to put 'noarchive' on one the larger ecommerce sites that we built and maintain for a client - mainly because there's really no reason for any of it to be cached, and there's a lot of garbage urls out there that have accumulated over the years (orphaned pages, old tracking url pages, etc) that we'd like to prevent from getting into the serps from here on out. We picked the site carefully as it's an experiment; if it works out, we'll be spreading it across more.

seodone

7:59 pm on Jul 25, 2008 (gmt 0)

This is interesting, I'm getting the Google We're Sorry...error message when I click on 'cached'. By reading the message I am not using an automated request and I don't have a virus or spyware. Anyone getting the same error message?

nick279

9:32 pm on Jul 25, 2008 (gmt 0)

I discovered a huge vulnerability in a site recently - Google has cached 2000+ business documents (pdfs, docs etc) and they are all available via it's cache: mechanism. I contacted the site owner saying their customers were at risk but apparently the chances of someone finding these are quite low? (talking business plans, projections, franshise deals etc)

Would be good if people fully understood that if they aren't careful an entire 'uploads' folder could be vulnerable..

Marcia

9:44 pm on Jul 25, 2008 (gmt 0)

I find the cache very useful for a couple of important reasons, so I choose to let pages be cached. I'm not worried about scraping, as soon as a page ranks at another engine the site is scraped to death anyway.

Frankly, when I see pages not cached my first thought is that they're cloaking.

pageoneresults

10:40 pm on Jul 25, 2008 (gmt 0)

Frankly, when I see pages not cached my first thought is that they're cloaking.

I used to think that too! And then I started doing some Cache Surfing.

Also, for some reason, the sites where noarchive is present seem to have less "abusive" activity than those that don't have it. That is just "my personal observation".

Run Forest, run!

Google, you'll have to accept my apology but I don't think that cache is good for some websites, particularly ecommerce and/or affiliate type sites. Nope, there's definitely something going on back there and I don't want to backtrack it. So, my best option right now is to...

<meta name="googlebot" content="noarchive">

I understand there are other ways (via robots.txt) to do this but for the most part, the above serves everyone.

dstiles

11:00 pm on Jul 25, 2008 (gmt 0)

Is there any evidence to suggest that noarchive affects SERPS in any way? Presumably google still indexes pages in the same fashion even with the noarchive.

ken_b

11:21 pm on Jul 25, 2008 (gmt 0)

Is there any risk in using the noarchive?

classifieds

11:51 pm on Jul 25, 2008 (gmt 0)

I used noarchive/nocache up until I read about Google's use of humans to review sites. One of the steps in their process was to compare the cache to the page for verifying cloaking/no-cloaking.

It's been 8 months and I haven't noticed a change in traffic or rankings. .

[edited by: classifieds at 11:59 pm (utc) on July 25, 2008]

nick279

11:54 pm on Jul 25, 2008 (gmt 0)

ken_b - sites with premium/paid membership (e.g. not showing their paid content via cache: ) to search engines would be unfairly penalised that be the case.

Pages using noarchive should display the same rankings/results other than the cache: operator not being available.

Key_Master

12:04 am on Jul 26, 2008 (gmt 0)

I've been using noarchive on all of my sites ever since it was introduced and have never been penalized for it.

If you have a lot of static pages and don't have the time to edit them all, a nifty solution for Apache servers is to use the X-Robots-Tag directive in the server headers. Example:

<Files ~ "\.html$">
Header append X-Robots-Tag "noarchive"
</Files>

nick279

12:10 am on Jul 26, 2008 (gmt 0)

Key_Master that's a great tip, cheers for posting

Key_Master

12:15 am on Jul 26, 2008 (gmt 0)

You're welcome and I should add that bit of code goes in the .htaccess file.

And you can apply the X-Robots-Tag directive to PHP, CF, or any other scripting language capable of modifying the server headers.

[edited by: Key_Master at 12:15 am (utc) on July 26, 2008]

maximillianos

1:14 am on Jul 28, 2008 (gmt 0)

I like the fact that Google can/will still serve up my pages even if my server goes down. That reason alone is enough for me to keep allowing G to cache my pages. They are providing a valuable service to we masters.

[please pass me my google brown nosing award now... ;-)]

timchuma

1:19 am on Jul 28, 2008 (gmt 0)

I have already had trouble with pages scraping USENET feeds and pages still being available even after I delete them.

jimh009

1:57 am on Jul 28, 2008 (gmt 0)

I can understand why a pure ecommerce site might find Google Cache problematic at times...as it might show goods that are out-of-stock or show services no longer offered.

But...as a user...I use Google Cache frequently, particularly on blogs that scroll endlessly. Google Cache IS a method that people can use to access and interface with your site. And as far bas I'm concerned, the more ways people can interact and access my site, the better.

timchuma

1:59 am on Jul 28, 2008 (gmt 0)

I also find it useful for gaining access to sites that are blocked on certain networks (shhh!) I only want to see the text anyway.

Brett_Tabke

2:00 am on Jul 28, 2008 (gmt 0)

I would use this instead: <meta name="robots" content="noarchive">

risks:
1- liability (especially if you allow user added content). It is bad enough you have to take it off your site, but to have to go futz with google as well is twice the burden.
2- price lists.
3- unknown legalities.
4- To prevent scraper sites from ripping your site out of Google in mass. A chinese set of scrapers took Billions of pages out of MSN and Yahoo a couple years ago via the "cached" pages. They then distributed those pages and killed rankings for alot of people for several months. Remember the "Redirect" issues googe had? Alot of that was tweaked out by people scaraping googles copy of your site.
5- why would you allow someone else to make money off your page without taking your cut!?

Yep...this is excellent:

<Files ~ "\.html$">
Header append X-Robots-Tag "noarchive"
</Files>

> One of the steps in their process was to
> compare the cache to the page

Their copy of your site is unstoppable and has nothing to do with the "cache" moniker page they display on your sites listings. When they surf the internal cache for hand check reasons, it is a pure copy of your site coming off their disks, exclusively within their network(s).

> why they changed

Let me be blunt - Google is probably too scared to talk about this in public. They never have in their 10 year history - I doubt they will now. The only talking they are doing about "cached" pages is in Court with Viacom.

I really don't care from the standpoint of Google good/bad/evil/demon/saint. I care from the standpoint of Webmasters and site owners. Few people - if anyone - are looking out for us on this issue. We are flying blind and our sites are being used without permission. Only by putting some nonstandard code on your site are you able to stop it. The burden shouldn't be on us to stop them - it should be on them to ask for permission to use it. Putting up code to stop it, is like saying that if you don't have a sign in your front yard saying "no stealing", then it is ok to come and take your stuff.

Think about it this way: they don't let people "cache" their search results. They republish the entire internet and yet won't let anyone on the internet republish their site?

related
[webmasterworld.com...]

Quadrille

2:28 am on Jul 28, 2008 (gmt 0)

I find the cache most useful for seeing what Google is doing; it says much more about that than what I'm doing!

Plus I've nothing to hide, and on more than one occasion I've rescued my own mistakes from Google's cache.

So while some people may have a problem - some for interesting reasons ;) - I find more plusses than negs.

And if gets up the noses of the civil liberties brigade, it gets my vote! :)

FWIW, I'm broadly a Google fan, with reservations, YMMV.

[edited by: Quadrille at 2:30 am (utc) on July 28, 2008]

amznVibe

3:56 am on Jul 28, 2008 (gmt 0)

Correct me if I am wrong but that htaccess header command will require "mod_headers" to be installed. It's not that common - in fact I don't think most cpanel servers have it by default.

MsHuggys

6:06 am on Jul 28, 2008 (gmt 0)

Knowing what it can do for you, tells you what it can do to you. When you learn to use it, after understanding that, it is quite impossible to think of the web without it.

I find it to be an invaluable tool, a poisonous snake that can disable enemies, while avoiding the mongoose guarding the mistress' house.

incrediBILL

6:44 am on Jul 28, 2008 (gmt 0)

Come on P1R, have you not read my many posts on WebmasterWorld ranting about the evils of cache?

It has nothing to do with whether or not you have anything to hide but it has everything to do with stopping people from taking things they shouldn't without permission.

Brett pretty much covered it above although I could add a few more items, it's been covered over and over and over such as these reasons:

[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

It will be addressed again @ PubCon this year so the rest of my answers will be there if you attend - hint hint ;)

[edited by: incrediBILL at 6:49 am (utc) on July 28, 2008]

paulmizen

8:51 am on Jul 28, 2008 (gmt 0)

For me the cache is quite usefull to easilly see the keywords I'm searching for on text and quickly evaluate if it is worth to take some time to read the page...

I gesss that this quick evaluation could be a huge downside to spammers... nevertheless, when the cache link is not available, I do not loose my time with such sites...

piatkow

9:22 am on Jul 28, 2008 (gmt 0)

My personal rule of thumb is noarchive for any page that is liable to be updated at least once a week.

incrediBILL

9:32 am on Jul 28, 2008 (gmt 0)

when the cache link is not available, I do not loose my time with such sites

Explain why clicking the link to the site is harder than clicking the link to the cache?

Even when I did allow Google to cache my site I used javascript to redirect to the live page so you wouldn't have seen the cache either way unless you had javascript disabled or NoScript enabled in FF.

This 82 message thread spans 3 pages: 82