What are the potential risks of Google Cache?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What are the potential risks of Google Cache?

pageoneresults

3:45 pm on Jul 24, 2008 (gmt 0)

Can someone tell me if the Google cache may present some potential risks and/or vulnerabilities?

The reason I ask is because I am coming across more and more stuff related to cache and it concerns me. What really concerns me, is that I'm able to surf around and see stuff that is supposedly removed and not available to the public yet I can get to it via the cache. I know, that is the purpose of the cache. But in this case, the cache is a risk.

Are any of you using noarchive on your pages? I am and have been for almost two years now. All new sites going online have the noarchive directive in most instances. There might be a few pages we leave open for cache but not many.

Am I being too paranoid?

Are you aware of any unusual stuff taking place using Google's cache mechanism? Any scraping? Anything?

skibum

9:52 pm on Jul 29, 2008 (gmt 0)

I use no cache from time to time. If there is ever anything that slips out or that some vendor has an issue with that you just want to make dissapear quickly, it sure helps if it is not cached.

I always thought cloaking too when I used to see no cache but in many cases, I think it worthwhile to keep pages from being cached by the engines regardless of how they are delivered.

pageoneresults

10:15 pm on Jul 29, 2008 (gmt 0)

I always thought cloaking too when I used to see no cache but in many cases.

I was in the same boat as everyone else. I sure wish I would have seen the forest through the trees and did the noarchive thing years ago.

I know that the initial test site where we added it has made some improvements over time. Who knows if it was the addition of the noarchive directive? I do know that the 404s and other related errors for that site dropped considerably not long after adding it.

If you want to visit our sites, use the link from the SERPs like most others do. If our site is down, which it shouldn't be, then please do come back later. The information in the cache was probably out of date anyway.

And, if you knew of an exact phrase that appeared on the page you were looking for on that site, just search Google for it and you can choose any one of the "scraped" versions and click on our AdWords listing in their blended AdSense implementation. :(

Quadrille

10:46 pm on Jul 29, 2008 (gmt 0)

Now, if we can get at least 10% of the web to noarchive, that would have to put a rather large dent in revenue, yes?

Er, no - I don't see how revenue is the issue.

While theoretically google could score on the cache (eg replacing my Adsense with their own), that does not happen, does it?

Quadrille

10:51 pm on Jul 29, 2008 (gmt 0)

I always thought cloaking too when I used to see no cache

That's the one downside of dumping the cache; it's still possible that we'll be helping cloakers.

I suspect not - by now, Google almost certainly have other ways to find them. But while once, noarchive was a big red flag, now it could mean all manner of things!

pageoneresults

11:01 pm on Jul 29, 2008 (gmt 0)

While theoretically google could score on the cache (eg replacing my Adsense with their own), that does not happen, does it?

I don't know. I do know that the scrapers take the content, regurgitate it and then place their AdSense on it. That's enough for me to use noarchive. And, I'm sure there are all sorts of other nifty little things happening in cache that we don't fully understand that someone will help us to understand before this year ends. ;)

I suspect not - by now, Google almost certainly have other ways to find them.

Ya, its called coming in under an IP that has not been published. :)

But while once, noarchive was a big red flag, now it could mean all manner of things!

I used to think so too. But, I don't think it was. We just happened to be somewhere at the time and the message was loud enough to make us think it was a "big red flag". Me thinks it was just one of the "signals" that Google use. There are many.

I really feel that noarchive is the way to go moving forward. At least it will be for us. In addition to that, we're getting ready to ban 75% of the planet anyway so I think we'll be covering most of our bases in regards to the "abuse" that takes place. ;)

incrediBILL

11:56 pm on Jul 29, 2008 (gmt 0)

While theoretically google could score on the cache (eg replacing my Adsense with their own), that does not happen, does it?

OK. I'll repeat it...

You click cache, then click BACK to the results, more Google ads (100% revenue to Google only) if you click cache and BACK again, more ads just for Google.

They're retaining the customer at that point and all of the revenue share since they aren't sharing it.

If you look closely everything the big SEs do is designed to retain the customer.

You provide news? They provide a news reader.

You provide images? They provide image search.

You provide products for sale? The provide a product search.

You provide web pages? They cache them and provide them all on demand.

Hell, not only that, now they provide free web pages and blogs for you so they have even more control over more content.

I could go on and on as it's a long and growing list.

It's all about visitor retention and the more services they provide, the fewer visitors we'll have eventually.

FYI, I think I mentioned it before, when I did allow cache way back when, I used frame buster code to redirect the visitor to my site and break out of the cache page, assuming javascript was enabled. Using that trick takes the visitor right out of the hands of the SE and delivers them direct to my site no matter which link they click, cache or otherwise.

However, the cache pages were still there to be viewed and scraped for anything/anyone with javascript disabled and the scrapers and spybots finally drove me to NOARCHIVE.

venti

12:11 am on Jul 30, 2008 (gmt 0)

We changed some css and js script names which then produced 404s when people used the cached links. I was honestly surprised by just how many people seemed to be using the cache (several dozen a day, it seems to be proportional to traffic). We had the setting in Adsense to only allow our domains to serve our ads so essentially we were losing out on revenue from all these users. It may not sound like much but added up over years it could be quite a bit. The combination of this and being scraped to no end resulted in us in using the noarchive tag on most pages. To date we have not seen a negative trend in traffic.

As a compromise to those users wanting the keyword highlighting functionality that the cache provides, we scripted up some client side code that would mimic it while not being too "in your face" and able to be turned off. It has helped raise or global time on site which was a nice bonus.

incrediBILL

12:29 am on Jul 30, 2008 (gmt 0)

we scripted up some client side code that would mimic it while not being too "in your face" and able to be turned off.

WebmasterWorld does the same thing, it's a very nice feature and as you've shown should be relegated back to the site for visitor retention.

eyecandy

6:08 pm on Jul 30, 2008 (gmt 0)

I do not allow caching of my website anymore.

Recently my website was hacked and when I typed in my website name at Google, it carried several of my "hacked" pages in it's cache, well after I had immediately wiped out my entire site and rebuilt it from scratch.

These pages carried a code that if clicked on could potentially redirect people to the hacker's sites. (I'm not an expert on these things but as far as I know, this was all it could do...not sure if it could download a trojan to computers or not)

I thought about emailing Google to tell them about this... then decided to first add "no archive" to all my meta tags, plus 'disallows' in robots.txt... Google updated immediately and the cached "hacked" pages disappeared.

I never had a problem, or thought there 'could' be a problem with caching before that happened.

incrediBILL

5:39 am on Jul 31, 2008 (gmt 0)

Time to make a point...

ATTENTION: for everyone that doesn't trust sites that use NOARCHIVE and have no cache, or won't visit if they can't see the cached page keywords highlighted first, did you know that WebmasterWorld uses NOARCHIVE and you can't view the cache ANYWHERE?

Of course WebmasterWorld hightlights the keywords on it's pages when you click thru, but that isn't the point now is it?

Let me spell it out: THIS THREAD WILL NOT BE CACHED.

EOM

[edited by: incrediBILL at 5:41 am (utc) on July 31, 2008]

sun818

5:43 am on Aug 1, 2008 (gmt 0)

I think you lose visitors and valuable log stats if you allow Google to cache your content. Bill, do you still make your income primarily from content pages? For web sites that sell widgets or services, it seems content duplicated by Google is less of an issue. Our revenue is not directly tied to content pages, but more to widgets and services sold.

> You provide products for sale? The provide a product search.

I understand your examples, but I don't see product search as an applicable example. Froogle & Google Base was never designed to compete with widget sites. It is free referral service.

incrediBILL

6:50 am on Aug 1, 2008 (gmt 0)

Froogle & Google Base was never designed to compete with widget sites. It is free referral service.

Yes, but it holds the customer longer on THEIR site instead of other product comparison sites.

Then suddenly they offer Google checkout, this is just a game of Google chess being slowly played out.

fabricator

2:16 am on Aug 6, 2008 (gmt 0)

Anyone know how to block non HTML files from being archived, besides using robots.txt ?

Also as googlebot can read gzipped html, it also reads gzip archives generally, like web server log files !

Key_Master

3:19 am on Aug 6, 2008 (gmt 0)

Anyone know how to block non HTML files from being archived, besides using robots.txt ?

You probably would want to use noindex instead of noarchive for log files. For Apache servers, you could use the following example that uses the X-Robots-Tag header directive.

<Files ~ "\.tar\.gz$">
Header append X-Robots-Tag "noindex"
</Files>

I also included the following examples that show how to apply X-Robots-Tag directives for multiple file types or file names:

<Files ~ "\.(gif�jp[eg]�png)$">
Header append X-Robots-Tag "noindex"
</Files>

<Files ~ "(about�privacy�contact_us)\.html$">
Header append X-Robots-Tag "noindex,nofollow"
</Files>

*Broken pipes in code above � need to be converted to single pipes.

ichthyous

4:55 pm on Aug 13, 2008 (gmt 0)

I have seen a big increase recently in the number of visits to my site from Google cached pages...it's about 22% of all visits from Google search (not including google images). Forgive my ignorance of theis topic but why would anyone click on the cached pages rather than the current page link? A quick inspection found that the cached pages were not too out of date. Does blocking caching improve traffic somehow?

dstiles

8:17 pm on Aug 13, 2008 (gmt 0)

I sometimes use google's cached pages if I'm searching for something that might go bang on the real site - eg when working out whether a new spider is good or bad.

Looking for info on SQL Injection attacks today I clicked on the cache and got an "accept cookie yes/no" request (via Firefox - I have it set to always ask about cookies). That implies the cache view itself is going to the web site for some reason, which of course would score a hit on the site. I assume it's some kind of "are you still active and virus-free" test. Of course, it may have happened previously and I never noticed. :)

On the other hand, I've just tested the cache on several sites I host and there was no cookie request for those (ASP sites so it's unavoidable).

Worth mentioning: I'm not running any kind of google toolbar nor any pre-fetch so it's not that that's going to the site.

pageoneresults

8:50 pm on Aug 18, 2008 (gmt 0)

What causes a website's SE cache to get replaced with another website?

tedster

9:03 pm on Aug 18, 2008 (gmt 0)

One possible reason is a DNS configuration issue. We've mentioned that in several threads here.

Another one I only saw recently is a server hack that injects a kind of cloaking script into a page. This is a kind of parasite hosting, designed so that all non-googlebot requests gets redirected the expected page but Google gets the parasite page. It's a good idea to browse your own site as googlebot once in a while.

Another possibility is badly configured shared hosting. A related issue is an unpatched vulernability in CPanel/Vdeck or whatever webmaster interface is in place. You could call this a "bad neighbor problem."

None of these are strictly speaking a "risk of the Google cache" however. Rather they are issues that the Google cache helps to illuminate.

pageoneresults

9:20 pm on Aug 18, 2008 (gmt 0)

The reason I ask is due to a recent comment from Brett concerning 301/302 hijacking, etc. I've not yet wanted to put labels on anything because I'm not exactly sure where the trail begins and what happened up to the point where I found it. This stuff makes your brain hurt!

4. 301/302 redirect hijacking (after all this time it STILL happens!)
2008-08-17 Brett_Tabke - I don't believe that. We have challenged and challenged anyone to show us an example - and no one has come up with one.

You know, when Brett states something like that, I want to listen. If what I'm seeing in cache is not some form of 301/302 hijack, then what is it? I need to know the proper nomenclature so when these things are found, I can properly label them moving forward. It won't be with my own properties as we are "noarchive" from this point forward. Have been for quite some time.

tedster, I see your list of issues and it just adds to the long list of potential risks associated with SE cache. Why the heck would someone want to replace your cached page with that of their own? One that looks like an MFA and has all the common footprints of one? I don't understand the complete logic there. What value lies in the cache for someone to take the initiative and hijack your cached page? What exactly are they doing with it?!?!?!

I've not been able to get direct answers to those questions for quite some time now. And since I can't, ain't no need for me to try and backtrack it any longer. I don't understand it, I'm going to what I can to block everything that "I think" is a potential risk. If I make the mistake and block something by accident, oh well, I'll find out sooner I'm sure. :)

tedster

9:42 pm on Aug 18, 2008 (gmt 0)

Why the heck would someone want to replace your cached page with that of their own?

The parasite's goal is not the cache - that's just a side effect. The goal is googlebot itself, getting THEIR links into the web graph instead of your page's links.

pageoneresults

9:57 pm on Aug 18, 2008 (gmt 0)

The goal is googlebot itself, getting THEIR links into the web graph instead of your page's links.

Okay, if that is the case with one example I've seen, I would think the intended goal would be to make Google think the site is cloaking. Since the cached version of the site is completely different, what does that do?

I'm still not 100% sure where the "benefit" is here from a monetary perspective for the cache controller. The way I see it and understand it, this is more of a sabotage effort to make something look like it really isn't.

Unless of course that whole "cache surfing" level is being using for something else like arbitrage and other ad metrics related activities and the distortion thereof?

ichthyous

3:24 pm on Aug 20, 2008 (gmt 0)

This may be an aside, but I have noticed an increase in traffic from google cache from 1% in April to 13% last month. It seems to have coincided with a lot of my pages being qualified as duplicates. WMT is reporting that my pages with dupe titles are almost all cleared up now...and I'm noticing the number of hits from Google Cache is suddenly reducing. What is the correlation between the two, if any?

This 82 message thread spans 3 pages: 82