I usually noarchive pages with sale prices on them - that's the extent of my current concern.
|That's the extent of my current concern. |
tedster, I'm surprised! I think we get too busy working with all the "other challenges" and overlook a few that may be potentially damaging in the long term.
I'd really like to discuss the concepts of Cache Surfing, Corporate Intellgence, Personal Snooping, etc. That cache "looks" like poison to me and I'm using noarchive again on almost everything. I feel safer for some reason and I need for my peers to tell me why I'm feeling that way. ;)
That cache does not serve my site visitor at all, none whatsoever. That would be my main reason for not allowing content to be cached, they don't click those cache links.
Now, if the above is the case, who then is using that cache and for what purpose? How was I able to gain access to some stuff that I should not have had access to through cache? This was stuff that should have been behind a login. Maybe a potential vulnerability within the site that was cached?
I have all sorts of questions about this because I believe that Google cache is poison for websites. It is being scraped, regurgitated, redirected, cloaked, you name it. So why would I as a website owner want to allow that to happen?
Also, what happens if you have a bad data push and don't realize it until it is too late. You surely don't want that being cached, do you? I need some help further understanding this. I'm traveling into the abyss, I know...
I too want to hear more on this as Brett suggested it as well.
[webmasterworld.com...] I can see some possible issues but need to futher understand the negative side of a cached page issue. Other than the just dad data showing.
I'm not too certain we are going to see much "in-depth" discussion on the potential risks of Google cache or any cache for that matter, this just doesn't apply to Google. That would be giving away a bit too much information.
I've come across some information that shows me exactly what Google cache "may" be used for. Heck, there are tools all over the place to surf the cache. Something ain't right there. And, I think the best suggestion is to add this to all pages if you just want to "feel safe" from it all. ;)
<meta name="googlebot" content="noarchive">
|The robots term of noarchive will produce the following effect; Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document. |
maybe Brett can have this brought up and discussed then at the pubcon
Funny this is posted today; we've just decided we're going to put 'noarchive' on one the larger ecommerce sites that we built and maintain for a client - mainly because there's really no reason for any of it to be cached, and there's a lot of garbage urls out there that have accumulated over the years (orphaned pages, old tracking url pages, etc) that we'd like to prevent from getting into the serps from here on out. We picked the site carefully as it's an experiment; if it works out, we'll be spreading it across more.
This is interesting, I'm getting the Google We're Sorry...error message when I click on 'cached'. By reading the message I am not using an automated request and I don't have a virus or spyware. Anyone getting the same error message?
I discovered a huge vulnerability in a site recently - Google has cached 2000+ business documents (pdfs, docs etc) and they are all available via it's cache: mechanism. I contacted the site owner saying their customers were at risk but apparently the chances of someone finding these are quite low? (talking business plans, projections, franshise deals etc)
Would be good if people fully understood that if they aren't careful an entire 'uploads' folder could be vulnerable..
I find the cache very useful for a couple of important reasons, so I choose to let pages be cached. I'm not worried about scraping, as soon as a page ranks at another engine the site is scraped to death anyway.
Frankly, when I see pages not cached my first thought is that they're cloaking.
|Frankly, when I see pages not cached my first thought is that they're cloaking. |
I used to think that too! And then I started doing some Cache Surfing.
Also, for some reason, the sites where noarchive is present seem to have less "abusive" activity than those that don't have it. That is just "my personal observation".
Run Forest, run!
Google, you'll have to accept my apology but I don't think that cache is good for some websites, particularly ecommerce and/or affiliate type sites. Nope, there's definitely something going on back there and I don't want to backtrack it. So, my best option right now is to...
<meta name="googlebot" content="noarchive">
I understand there are other ways (via robots.txt) to do this but for the most part, the above serves everyone.
Is there any evidence to suggest that noarchive affects SERPS in any way? Presumably google still indexes pages in the same fashion even with the noarchive.
Is there any risk in using the noarchive?
I used noarchive/nocache up until I read about Google's use of humans to review sites. One of the steps in their process was to compare the cache to the page for verifying cloaking/no-cloaking.
It's been 8 months and I haven't noticed a change in traffic or rankings. .
[edited by: classifieds at 11:59 pm (utc) on July 25, 2008]
ken_b - sites with premium/paid membership (e.g. not showing their paid content via cache: ) to search engines would be unfairly penalised that be the case.
Pages using noarchive should display the same rankings/results other than the cache: operator not being available.
I've been using noarchive on all of my sites ever since it was introduced and have never been penalized for it.
If you have a lot of static pages and don't have the time to edit them all, a nifty solution for Apache servers is to use the X-Robots-Tag directive in the server headers. Example:
<Files ~ "\.html$">
Header append X-Robots-Tag "noarchive"
Key_Master that's a great tip, cheers for posting
You're welcome and I should add that bit of code goes in the .htaccess file.
And you can apply the X-Robots-Tag directive to PHP, CF, or any other scripting language capable of modifying the server headers.
[edited by: Key_Master at 12:15 am (utc) on July 26, 2008]
I like the fact that Google can/will still serve up my pages even if my server goes down. That reason alone is enough for me to keep allowing G to cache my pages. They are providing a valuable service to we masters.
[please pass me my google brown nosing award now... ;-)]
I have already had trouble with pages scraping USENET feeds and pages still being available even after I delete them.
I can understand why a pure ecommerce site might find Google Cache problematic at times...as it might show goods that are out-of-stock or show services no longer offered.
But...as a user...I use Google Cache frequently, particularly on blogs that scroll endlessly. Google Cache IS a method that people can use to access and interface with your site. And as far bas I'm concerned, the more ways people can interact and access my site, the better.
I also find it useful for gaining access to sites that are blocked on certain networks (shhh!) I only want to see the text anyway.
I would use this instead: <meta name="robots" content="noarchive">
1- liability (especially if you allow user added content). It is bad enough you have to take it off your site, but to have to go futz with google as well is twice the burden.
2- price lists.
3- unknown legalities.
4- To prevent scraper sites from ripping your site out of Google in mass. A chinese set of scrapers took Billions of pages out of MSN and Yahoo a couple years ago via the "cached" pages. They then distributed those pages and killed rankings for alot of people for several months. Remember the "Redirect" issues googe had? Alot of that was tweaked out by people scaraping googles copy of your site.
5- why would you allow someone else to make money off your page without taking your cut!?
Yep...this is excellent:
<Files ~ "\.html$">
Header append X-Robots-Tag "noarchive"
> One of the steps in their process was to
> compare the cache to the page
Their copy of your site is unstoppable and has nothing to do with the "cache" moniker page they display on your sites listings. When they surf the internal cache for hand check reasons, it is a pure copy of your site coming off their disks, exclusively within their network(s).
> why they changed
Let me be blunt - Google is probably too scared to talk about this in public. They never have in their 10 year history - I doubt they will now. The only talking they are doing about "cached" pages is in Court with Viacom.
I really don't care from the standpoint of Google good/bad/evil/demon/saint. I care from the standpoint of Webmasters and site owners. Few people - if anyone - are looking out for us on this issue. We are flying blind and our sites are being used without permission. Only by putting some nonstandard code on your site are you able to stop it. The burden shouldn't be on us to stop them - it should be on them to ask for permission to use it. Putting up code to stop it, is like saying that if you don't have a sign in your front yard saying "no stealing", then it is ok to come and take your stuff.
Think about it this way: they don't let people "cache" their search results. They republish the entire internet and yet won't let anyone on the internet republish their site?
I find the cache most useful for seeing what Google is doing; it says much more about that than what I'm doing!
Plus I've nothing to hide, and on more than one occasion I've rescued my own mistakes from Google's cache.
So while some people may have a problem - some for interesting reasons ;) - I find more plusses than negs.
And if gets up the noses of the civil liberties brigade, it gets my vote! :)
FWIW, I'm broadly a Google fan, with reservations, YMMV.
[edited by: Quadrille at 2:30 am (utc) on July 28, 2008]
Correct me if I am wrong but that htaccess header command will require "mod_headers" to be installed. It's not that common - in fact I don't think most cpanel servers have it by default.
Knowing what it can do for you, tells you what it can do to you. When you learn to use it, after understanding that, it is quite impossible to think of the web without it.
I find it to be an invaluable tool, a poisonous snake that can disable enemies, while avoiding the mongoose guarding the mistress' house.
Come on P1R, have you not read my many posts on WebmasterWorld ranting about the evils of cache?
It has nothing to do with whether or not you have anything to hide but it has everything to do with stopping people from taking things they shouldn't without permission.
Brett pretty much covered it above although I could add a few more items, it's been covered over and over and over such as these reasons:
It will be addressed again @ PubCon this year so the rest of my answers will be there if you attend - hint hint ;)
[edited by: incrediBILL at 6:49 am (utc) on July 28, 2008]
For me the cache is quite usefull to easilly see the keywords I'm searching for on text and quickly evaluate if it is worth to take some time to read the page...
I gesss that this quick evaluation could be a huge downside to spammers... nevertheless, when the cache link is not available, I do not loose my time with such sites...
My personal rule of thumb is noarchive for any page that is liable to be updated at least once a week.
|when the cache link is not available, I do not loose my time with such sites |
Explain why clicking the link to the site is harder than clicking the link to the cache?
| This 82 message thread spans 3 pages: 82 (  2 3 ) > > |