Forum Moderators: Robert Charlton & goodroi
The reason I ask is because I am coming across more and more stuff related to cache and it concerns me. What really concerns me, is that I'm able to surf around and see stuff that is supposedly removed and not available to the public yet I can get to it via the cache. I know, that is the purpose of the cache. But in this case, the cache is a risk.
Are any of you using noarchive on your pages? I am and have been for almost two years now. All new sites going online have the noarchive directive in most instances. There might be a few pages we leave open for cache but not many.
Am I being too paranoid?
Are you aware of any unusual stuff taking place using Google's cache mechanism? Any scraping? Anything?
That's not the point. The point is if I click the link to your site, it is impossible to the see the keywords I'm searching for with a colored background... so it is impossible to quickly evaluate my keywords in context and see if it is worth reading more, unless that's precisely your purpouse.
he point is if I click the link to your site, it is impossible to the see the keywords I'm searching for with a colored background
So you never go to the actual website and just read cache?
Well, I'd get over that concept as more and more sites are opting out of cache.
For starters, it's way beyond FAIR USE copyright as Google republishes entire websites and people are finally catching on that they're losing traffic to Google itself with people doing exactly what you're doing, which is another reason we're adding NOARCHIVE to our sites.
Google republishes entire websites and people are finally catching on that they're losing traffic to Google itself with people doing exactly what you're doing, which is another reason we're adding NOARCHIVE to our sites.
The NoArchive Initiative
I'll be back later today with responses to much of what has been posted. I have a lot of reading to do before I opine...
Thanks WebmasterWorld for making this Front Page!. :)
A chinese set of scrapers took Billions of pages out of MSN and Yahoo a couple years ago via the "cached" pages.
How do we know they scraped via the cache, and how is it that the cache increases the scraper risk?
Non technical explanation if possible! :)
How do we know they scraped via the cache
No footprints in your logfiles.
...and how is it that the cache increases the scraper risk?
You can detect and block scrapers running directly against your site, you cannot either detect or block them crawling cached pages, as they're crawling Google's servers, not yours.
How do we know they scraped via the cache, and how is it that the cache increases the scraper risk?
The cache doesn't increase the risk, it just gives alternative locations for accessing content when you block them from your web site is all. If you don't use NOARCHIVE to stop search engines from republishing your entire site and block ARCHIVE.ORG from crawling and indexing your site in the Internet Archives, you will never ever be able to stop the AdSense funded underbelly of the internet from taking your content and using it to make money often at your own expense.
Most people couldn't tell where the scraping occurs but I cloak small data fragments into my web pages that identify the source of the page request for tracking purposes. That way I know, assuming the tracking bug survives, whether they scraped it off my site or elsewhere.
Also, some scrapers and spybots attempt to access graphics on the page so if you have any type of graphics server, such as an ad banner server or an embeeded traffic tracker, they sometimes download these and you can see the origin of the page access is Google's IP instead of your server.
Next question; I can see several reasons why SEs might want to maintain a cached copy for their convenience; what I've never understood is why this is published at all.
Has any SE ever offered a reason, or is just some tradition from the early days, that has never been reversed?
Or is it a sop to civil liberties "here, matey, you can see what we're caching, now stop moaning"?
what I've never understood is why this is published at all
Imagine if your ads were all over those pages, maybe that's their long term end game when people finally get used to the idea they have no control over their own content or eventually someone files suit about fair use and loses, then it's game over.
Why publish it is obvious, keep the visitor on the search engine and not YOUR site.
Read up in some posts earlier in this thread and you'll see comments about people reading the cache page to see where the keywords are, yada yada, so in effect the SE has already accomplished locking people into their site with more opportunity to push ads in their face.
NOARCHIVE is just a small way to take back control from the SEs and avoid the situation getting worse as it certainly won't get any better.
Now if we could just get large shared hosts to put "Header append X-Robots-Tag "noarchive"" as part of their standard server configuration the SEs would lose a significant amount of cached pages in just a few days.
But they had published caches before ads figured so much ... and despite the number of 'insiders' who view and use the cache, do searchers do so in the wider world?
Most people I know wouldn't know what a cache was, and have probably never clicked on one in their life.
Don't get me wrong; I've read the earlier posts very carefully, and it's certainly made me think about my position, but most of the issues raised just didn't apply back in the day.
Now if we could just get large shared hosts to put "Header append X-Robots-Tag "noarchive"" as part of their standard server configuration the SEs would lose a significant amount of cached pages in just a few days.
Certainly a choice that should be offered.
Isn't Google losing money because of the cache? When I see my cached pages with adsense it looks like they aren't displaying the correct ads so Google must be losing money too.
Think branding and customer retention.
The cache page is just a gimmic that holds the customer on the site.
Even if AdSense isn't working well in the cache it's bad for you, not Google. When the customer hits BACK they see and possibly click on the Google ads and get 100% of the earnings on their own page without sharing it with your site.
So they get 100% of the ad money and hold the customer on their site longer instead of releasing the customer to your site.
Get the picture?
Anyone using cache is win-win for Google and lose-lose for webmasters.
Just say NOARCHIVE and take back your customers.
As you might gess, that's not the rule. If I am convinced that the content is worth reading, of course I follow the links on that site read more, bookmart it and talk about it on the fora I visit. I think that you understand that would be a big hassle to go back to google to underline the other pages' keywords...
I admit that if you site do not have content worth visiting, it is better to use NOARCHIVE, I would also benefit by not landing on your site. Please do.
Competitors *do* use the cached copy sometimes to try to derive how pages are ranking for particular keywords. Viewing the cache can show whether keywords are found in links pointing to the page, for instance.
If you wanted to keep some of your search optimization work just slightly more obscured, keep the pages from getting cached. It doesn't always mean you're cloaking.
Overall, I think to cache or not to cache decisions should be made on a case by case basis.
I think the discussion should be mostly towards unwanted intrusion from competition and scrapers.
I have made it a point when I visit a person to view what they have installed as to tool bars and can say over 95% don't have a tool bar installed maybe higher and if they do 99.9999% of them won't go in an enable this through options.
So I feel if your not wanting your page cached it is just to keep unwanted competition from viewing your work highlighted in terms they search and or scrapers from getting it without your knowledge.
Using the cache to keep a normal user on the SE is in my option not a real valid point, but blocking bots and unwanted eyes is a very valid point.
Mind you, I'd guess that google has identified other markers by now.
I'd agree that for most of us, security is the issue, rather than civil liberties or 'sticking it to Google'
... the fact that [plaintiff] Field, who had admitted he knew how to disable the caching feature, had not done so created "an implied licence in favour of Google", said Judge Jones.
I also think cache blocking is great for controversial/free speech sites b/c opposing counsel won't necessarily find questionable content. I think I read somewhere a while back Slashdot users use the cache more than anyone. It's a geek thing.
p/g
I'd agree that for most of us, security is the issue, rather than civil liberties or 'sticking it to Google'
For me it's also a matter of copyright issues.
Most people use the DMCA to stop others from stealing their content yet allow Google to freely republish their entire website verbatim as individual cache pages without repercussion.
Since when did protecting your copyright become OPT-OUT, such as using NOARCHIVE, instead of allowing cache pages?
Using the cache to keep a normal user on the SE is in my option not a real valid point
If it wasn't a valid point then why so all the SE's have a cache page?
If it had no value, they wouldn't all be doing it because the masses of people using the SE's aren't SEO's or webmasters examining pages for keywords and signs of cloaking.
I'm pretty sure it is useful for customer retention for people like paulmizen who want the keywords on the page highlighted.
If my website doesn't provide that feature and Google does, which page do you think someone might find more useful?
Companies don't invest a lot of money in features that have no ROI.
Up until now I've not excluded any pages from the cache, but it's a good time to experiment. I use the cache a lot, and I've noticed that it has become slightly more common to find sites using the noarchive directive. So the whole idea that these sites must be up to no good is out of the window, it might have been an indicator of sneakiness in 2005 but not any more.
Like potentialgeek I use hotlink protection, so my cached version doesn't look great either, but this only affects real human users. Scrapers will usually be sending a blank referrer anyway.
For me it's also a matter of copyright issues.
Legally, of course, there is no copyright issue, as Google won when a dispute came to court.
I remember being stunned when the court accepted Google's statment that caches were generally held "for a couple of weeks" - though my memory is not 100% on that quote.
While routine cache publication of whole pages is blindingly obviously in breech of national and international law, I don't see anyone rushing to establish this in court.
I don't see anyone rushing to establish this in court.
That's because I don't think there is a lawyer out there who would understand the ramifications of what "may" be happening with cache.
That's fine. Since we can't fix the source, we can always prevent the source from displaying cache. That's the first step. Now, if we can get at least 10% of the web to noarchive, that would have to put a rather large dent in revenue, yes?
I'll be back with a bunch more. Me schedule is a bit whacked right now but I'm chomping at the bit to get verbose on this topic!
[edited by: pageoneresults at 6:21 pm (utc) on July 29, 2008]
The last time I checked, many many more users turned on the PageRank display than there are site owners. The PageRank display is actually a popular feature, as it turns out.