Forum Moderators: Robert Charlton & goodroi
The reason I ask is because I am coming across more and more stuff related to cache and it concerns me. What really concerns me, is that I'm able to surf around and see stuff that is supposedly removed and not available to the public yet I can get to it via the cache. I know, that is the purpose of the cache. But in this case, the cache is a risk.
Are any of you using noarchive on your pages? I am and have been for almost two years now. All new sites going online have the noarchive directive in most instances. There might be a few pages we leave open for cache but not many.
Am I being too paranoid?
Are you aware of any unusual stuff taking place using Google's cache mechanism? Any scraping? Anything?
I always thought cloaking too when I used to see no cache but in many cases, I think it worthwhile to keep pages from being cached by the engines regardless of how they are delivered.
I always thought cloaking too when I used to see no cache but in many cases.
I was in the same boat as everyone else. I sure wish I would have seen the forest through the trees and did the noarchive thing years ago.
I know that the initial test site where we added it has made some improvements over time. Who knows if it was the addition of the noarchive directive? I do know that the 404s and other related errors for that site dropped considerably not long after adding it.
If you want to visit our sites, use the link from the SERPs like most others do. If our site is down, which it shouldn't be, then please do come back later. The information in the cache was probably out of date anyway.
And, if you knew of an exact phrase that appeared on the page you were looking for on that site, just search Google for it and you can choose any one of the "scraped" versions and click on our AdWords listing in their blended AdSense implementation. :(
I always thought cloaking too when I used to see no cache
That's the one downside of dumping the cache; it's still possible that we'll be helping cloakers.
I suspect not - by now, Google almost certainly have other ways to find them. But while once, noarchive was a big red flag, now it could mean all manner of things!
While theoretically google could score on the cache (eg replacing my Adsense with their own), that does not happen, does it?
I don't know. I do know that the scrapers take the content, regurgitate it and then place their AdSense on it. That's enough for me to use noarchive. And, I'm sure there are all sorts of other nifty little things happening in cache that we don't fully understand that someone will help us to understand before this year ends. ;)
I suspect not - by now, Google almost certainly have other ways to find them.
Ya, its called coming in under an IP that has not been published. :)
But while once, noarchive was a big red flag, now it could mean all manner of things!
I used to think so too. But, I don't think it was. We just happened to be somewhere at the time and the message was loud enough to make us think it was a "big red flag". Me thinks it was just one of the "signals" that Google use. There are many.
I really feel that noarchive is the way to go moving forward. At least it will be for us. In addition to that, we're getting ready to ban 75% of the planet anyway so I think we'll be covering most of our bases in regards to the "abuse" that takes place. ;)
While theoretically google could score on the cache (eg replacing my Adsense with their own), that does not happen, does it?
OK. I'll repeat it...
You click cache, then click BACK to the results, more Google ads (100% revenue to Google only) if you click cache and BACK again, more ads just for Google.
They're retaining the customer at that point and all of the revenue share since they aren't sharing it.
If you look closely everything the big SEs do is designed to retain the customer.
You provide news? They provide a news reader.
You provide images? They provide image search.
You provide products for sale? The provide a product search.
You provide web pages? They cache them and provide them all on demand.
Hell, not only that, now they provide free web pages and blogs for you so they have even more control over more content.
I could go on and on as it's a long and growing list.
It's all about visitor retention and the more services they provide, the fewer visitors we'll have eventually.
FYI, I think I mentioned it before, when I did allow cache way back when, I used frame buster code to redirect the visitor to my site and break out of the cache page, assuming javascript was enabled. Using that trick takes the visitor right out of the hands of the SE and delivers them direct to my site no matter which link they click, cache or otherwise.
However, the cache pages were still there to be viewed and scraped for anything/anyone with javascript disabled and the scrapers and spybots finally drove me to NOARCHIVE.
As a compromise to those users wanting the keyword highlighting functionality that the cache provides, we scripted up some client side code that would mimic it while not being too "in your face" and able to be turned off. It has helped raise or global time on site which was a nice bonus.
Recently my website was hacked and when I typed in my website name at Google, it carried several of my "hacked" pages in it's cache, well after I had immediately wiped out my entire site and rebuilt it from scratch.
These pages carried a code that if clicked on could potentially redirect people to the hacker's sites. (I'm not an expert on these things but as far as I know, this was all it could do...not sure if it could download a trojan to computers or not)
I thought about emailing Google to tell them about this... then decided to first add "no archive" to all my meta tags, plus 'disallows' in robots.txt... Google updated immediately and the cached "hacked" pages disappeared.
I never had a problem, or thought there 'could' be a problem with caching before that happened.
ATTENTION: for everyone that doesn't trust sites that use NOARCHIVE and have no cache, or won't visit if they can't see the cached page keywords highlighted first, did you know that WebmasterWorld uses NOARCHIVE and you can't view the cache ANYWHERE?
Of course WebmasterWorld hightlights the keywords on it's pages when you click thru, but that isn't the point now is it?
Let me spell it out: THIS THREAD WILL NOT BE CACHED.
EOM
[edited by: incrediBILL at 5:41 am (utc) on July 31, 2008]
> You provide products for sale? The provide a product search.
I understand your examples, but I don't see product search as an applicable example. Froogle & Google Base was never designed to compete with widget sites. It is free referral service.
Anyone know how to block non HTML files from being archived, besides using robots.txt ?
You probably would want to use noindex instead of noarchive for log files. For Apache servers, you could use the following example that uses the X-Robots-Tag header directive.
<Files ~ "\.tar\.gz$">
Header append X-Robots-Tag "noindex"
</Files>
I also included the following examples that show how to apply X-Robots-Tag directives for multiple file types or file names:
<Files ~ "\.(gif¦jp[eg]¦png)$">
Header append X-Robots-Tag "noindex"
</Files>
<Files ~ "(about¦privacy¦contact_us)\.html$">
Header append X-Robots-Tag "noindex,nofollow"
</Files>
*Broken pipes in code above ¦ need to be converted to single pipes.
Looking for info on SQL Injection attacks today I clicked on the cache and got an "accept cookie yes/no" request (via Firefox - I have it set to always ask about cookies). That implies the cache view itself is going to the web site for some reason, which of course would score a hit on the site. I assume it's some kind of "are you still active and virus-free" test. Of course, it may have happened previously and I never noticed. :)
On the other hand, I've just tested the cache on several sites I host and there was no cookie request for those (ASP sites so it's unavoidable).
Worth mentioning: I'm not running any kind of google toolbar nor any pre-fetch so it's not that that's going to the site.
Another one I only saw recently is a server hack that injects a kind of cloaking script into a page. This is a kind of parasite hosting, designed so that all non-googlebot requests gets redirected the expected page but Google gets the parasite page. It's a good idea to browse your own site as googlebot once in a while.
Another possibility is badly configured shared hosting. A related issue is an unpatched vulernability in CPanel/Vdeck or whatever webmaster interface is in place. You could call this a "bad neighbor problem."
None of these are strictly speaking a "risk of the Google cache" however. Rather they are issues that the Google cache helps to illuminate.
4. 301/302 redirect hijacking (after all this time it STILL happens!)
2008-08-17 Brett_Tabke - I don't believe that. We have challenged and challenged anyone to show us an example - and no one has come up with one.
You know, when Brett states something like that, I want to listen. If what I'm seeing in cache is not some form of 301/302 hijack, then what is it? I need to know the proper nomenclature so when these things are found, I can properly label them moving forward. It won't be with my own properties as we are "noarchive" from this point forward. Have been for quite some time.
tedster, I see your list of issues and it just adds to the long list of potential risks associated with SE cache. Why the heck would someone want to replace your cached page with that of their own? One that looks like an MFA and has all the common footprints of one? I don't understand the complete logic there. What value lies in the cache for someone to take the initiative and hijack your cached page? What exactly are they doing with it?!?!?!
I've not been able to get direct answers to those questions for quite some time now. And since I can't, ain't no need for me to try and backtrack it any longer. I don't understand it, I'm going to what I can to block everything that "I think" is a potential risk. If I make the mistake and block something by accident, oh well, I'll find out sooner I'm sure. :)
The goal is googlebot itself, getting THEIR links into the web graph instead of your page's links.
Okay, if that is the case with one example I've seen, I would think the intended goal would be to make Google think the site is cloaking. Since the cached version of the site is completely different, what does that do?
I'm still not 100% sure where the "benefit" is here from a monetary perspective for the cache controller. The way I see it and understand it, this is more of a sabotage effort to make something look like it really isn't.
Unless of course that whole "cache surfing" level is being using for something else like arbitrage and other ad metrics related activities and the distortion thereof?