Forum Moderators: Robert Charlton & goodroi
The reason I ask is because I am coming across more and more stuff related to cache and it concerns me. What really concerns me, is that I'm able to surf around and see stuff that is supposedly removed and not available to the public yet I can get to it via the cache. I know, that is the purpose of the cache. But in this case, the cache is a risk.
Are any of you using noarchive on your pages? I am and have been for almost two years now. All new sites going online have the noarchive directive in most instances. There might be a few pages we leave open for cache but not many.
Am I being too paranoid?
Are you aware of any unusual stuff taking place using Google's cache mechanism? Any scraping? Anything?
That's the extent of my current concern.
tedster, I'm surprised! I think we get too busy working with all the "other challenges" and overlook a few that may be potentially damaging in the long term.
I'd really like to discuss the concepts of Cache Surfing, Corporate Intellgence, Personal Snooping, etc. That cache "looks" like poison to me and I'm using noarchive again on almost everything. I feel safer for some reason and I need for my peers to tell me why I'm feeling that way. ;)
That cache does not serve my site visitor at all, none whatsoever. That would be my main reason for not allowing content to be cached, they don't click those cache links.
Now, if the above is the case, who then is using that cache and for what purpose? How was I able to gain access to some stuff that I should not have had access to through cache? This was stuff that should have been behind a login. Maybe a potential vulnerability within the site that was cached?
I have all sorts of questions about this because I believe that Google cache is poison for websites. It is being scraped, regurgitated, redirected, cloaked, you name it. So why would I as a website owner want to allow that to happen?
Also, what happens if you have a bad data push and don't realize it until it is too late. You surely don't want that being cached, do you? I need some help further understanding this. I'm traveling into the abyss, I know...
I've come across some information that shows me exactly what Google cache "may" be used for. Heck, there are tools all over the place to surf the cache. Something ain't right there. And, I think the best suggestion is to add this to all pages if you just want to "feel safe" from it all. ;)
<meta name="googlebot" content="noarchive"> The robots term of noarchive will produce the following effect; Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document.
Would be good if people fully understood that if they aren't careful an entire 'uploads' folder could be vulnerable..
Frankly, when I see pages not cached my first thought is that they're cloaking.
I used to think that too! And then I started doing some Cache Surfing.
Also, for some reason, the sites where noarchive is present seem to have less "abusive" activity than those that don't have it. That is just "my personal observation".
Run Forest, run!
Google, you'll have to accept my apology but I don't think that cache is good for some websites, particularly ecommerce and/or affiliate type sites. Nope, there's definitely something going on back there and I don't want to backtrack it. So, my best option right now is to...
<meta name="googlebot" content="noarchive"> I understand there are other ways (via robots.txt) to do this but for the most part, the above serves everyone.
It's been 8 months and I haven't noticed a change in traffic or rankings. .
[edited by: classifieds at 11:59 pm (utc) on July 25, 2008]
If you have a lot of static pages and don't have the time to edit them all, a nifty solution for Apache servers is to use the X-Robots-Tag directive in the server headers. Example:
<Files ~ "\.html$">
Header append X-Robots-Tag "noarchive"
</Files>
But...as a user...I use Google Cache frequently, particularly on blogs that scroll endlessly. Google Cache IS a method that people can use to access and interface with your site. And as far bas I'm concerned, the more ways people can interact and access my site, the better.
risks:
1- liability (especially if you allow user added content). It is bad enough you have to take it off your site, but to have to go futz with google as well is twice the burden.
2- price lists.
3- unknown legalities.
4- To prevent scraper sites from ripping your site out of Google in mass. A chinese set of scrapers took Billions of pages out of MSN and Yahoo a couple years ago via the "cached" pages. They then distributed those pages and killed rankings for alot of people for several months. Remember the "Redirect" issues googe had? Alot of that was tweaked out by people scaraping googles copy of your site.
5- why would you allow someone else to make money off your page without taking your cut!?
Yep...this is excellent:
<Files ~ "\.html$">
Header append X-Robots-Tag "noarchive"
</Files>
> One of the steps in their process was to
> compare the cache to the page
Their copy of your site is unstoppable and has nothing to do with the "cache" moniker page they display on your sites listings. When they surf the internal cache for hand check reasons, it is a pure copy of your site coming off their disks, exclusively within their network(s).
> why they changed
Let me be blunt - Google is probably too scared to talk about this in public. They never have in their 10 year history - I doubt they will now. The only talking they are doing about "cached" pages is in Court with Viacom.
I really don't care from the standpoint of Google good/bad/evil/demon/saint. I care from the standpoint of Webmasters and site owners. Few people - if anyone - are looking out for us on this issue. We are flying blind and our sites are being used without permission. Only by putting some nonstandard code on your site are you able to stop it. The burden shouldn't be on us to stop them - it should be on them to ask for permission to use it. Putting up code to stop it, is like saying that if you don't have a sign in your front yard saying "no stealing", then it is ok to come and take your stuff.
Think about it this way: they don't let people "cache" their search results. They republish the entire internet and yet won't let anyone on the internet republish their site?
related
[webmasterworld.com...]
Plus I've nothing to hide, and on more than one occasion I've rescued my own mistakes from Google's cache.
So while some people may have a problem - some for interesting reasons ;) - I find more plusses than negs.
And if gets up the noses of the civil liberties brigade, it gets my vote! :)
FWIW, I'm broadly a Google fan, with reservations, YMMV.
[edited by: Quadrille at 2:30 am (utc) on July 28, 2008]
I find it to be an invaluable tool, a poisonous snake that can disable enemies, while avoiding the mongoose guarding the mistress' house.
It has nothing to do with whether or not you have anything to hide but it has everything to do with stopping people from taking things they shouldn't without permission.
Brett pretty much covered it above although I could add a few more items, it's been covered over and over and over such as these reasons:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
It will be addressed again @ PubCon this year so the rest of my answers will be there if you attend - hint hint ;)
[edited by: incrediBILL at 6:49 am (utc) on July 28, 2008]
I gesss that this quick evaluation could be a huge downside to spammers... nevertheless, when the cache link is not available, I do not loose my time with such sites...
when the cache link is not available, I do not loose my time with such sites
Explain why clicking the link to the site is harder than clicking the link to the cache?
Even when I did allow Google to cache my site I used javascript to redirect to the live page so you wouldn't have seen the cache either way unless you had javascript disabled or NoScript enabled in FF.