Welcome to WebmasterWorld Guest from 35.172.100.232

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What are the potential risks of Google Cache?

     
3:45 pm on Jul 24, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12172
votes: 61


Can someone tell me if the Google cache may present some potential risks and/or vulnerabilities?

The reason I ask is because I am coming across more and more stuff related to cache and it concerns me. What really concerns me, is that I'm able to surf around and see stuff that is supposedly removed and not available to the public yet I can get to it via the cache. I know, that is the purpose of the cache. But in this case, the cache is a risk.

Are any of you using noarchive on your pages? I am and have been for almost two years now. All new sites going online have the noarchive directive in most instances. There might be a few pages we leave open for cache but not many.

Am I being too paranoid?

Are you aware of any unusual stuff taking place using Google's cache mechanism? Any scraping? Anything?

10:36 am on July 28, 2008 (gmt 0)

New User

10+ Year Member

joined:Dec 6, 2005
posts:4
votes: 0


> Explain why clicking the link to the site is harder than clicking the link to the cache?

That's not the point. The point is if I click the link to your site, it is impossible to the see the keywords I'm searching for with a colored background... so it is impossible to quickly evaluate my keywords in context and see if it is worth reading more, unless that's precisely your purpouse.

7:06 pm on July 28, 2008 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


he point is if I click the link to your site, it is impossible to the see the keywords I'm searching for with a colored background

So you never go to the actual website and just read cache?

Well, I'd get over that concept as more and more sites are opting out of cache.

For starters, it's way beyond FAIR USE copyright as Google republishes entire websites and people are finally catching on that they're losing traffic to Google itself with people doing exactly what you're doing, which is another reason we're adding NOARCHIVE to our sites.

7:36 pm on July 28, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12172
votes: 61


Google republishes entire websites and people are finally catching on that they're losing traffic to Google itself with people doing exactly what you're doing, which is another reason we're adding NOARCHIVE to our sites.

The NoArchive Initiative

I'll be back later today with responses to much of what has been posted. I have a lot of reading to do before I opine...

Thanks WebmasterWorld for making this Front Page!. :)

9:50 pm on July 28, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member ken_b is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 5, 2001
posts:5892
votes: 120


OK, so if a page has already been cached and you later install the noarchive, will Google stop serving up the old cached version of the page?
9:52 pm on July 28, 2008 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


will Google stop serving up the old cached version of the page

Yes, once it's reindexed they'll stop serving up the old cached copy.

However, if you have a really big and deep site, the complete crawl can take weeks.

9:59 pm on July 28, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member ken_b is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 5, 2001
posts:5892
votes: 120


Thanks incrediBILL.
11:04 pm on July 28, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member quadrille is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 22, 2002
posts:3455
votes: 0


this is a very interesting discussion, and a good thread, but I'm still not quite clear about some of the 'risks'. for example:

A chinese set of scrapers took Billions of pages out of MSN and Yahoo a couple years ago via the "cached" pages.

How do we know they scraped via the cache, and how is it that the cache increases the scraper risk?

Non technical explanation if possible! :)

11:10 pm on July 28, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member trillianjedi is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 15, 2003
posts:7256
votes: 3


How do we know they scraped via the cache

No footprints in your logfiles.

...and how is it that the cache increases the scraper risk?

You can detect and block scrapers running directly against your site, you cannot either detect or block them crawling cached pages, as they're crawling Google's servers, not yours.

11:20 pm on July 28, 2008 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


How do we know they scraped via the cache, and how is it that the cache increases the scraper risk?

The cache doesn't increase the risk, it just gives alternative locations for accessing content when you block them from your web site is all. If you don't use NOARCHIVE to stop search engines from republishing your entire site and block ARCHIVE.ORG from crawling and indexing your site in the Internet Archives, you will never ever be able to stop the AdSense funded underbelly of the internet from taking your content and using it to make money often at your own expense.

Most people couldn't tell where the scraping occurs but I cloak small data fragments into my web pages that identify the source of the page request for tracking purposes. That way I know, assuming the tracking bug survives, whether they scraped it off my site or elsewhere.

Also, some scrapers and spybots attempt to access graphics on the page so if you have any type of graphics server, such as an ad banner server or an embeeded traffic tracker, they sometimes download these and you can see the origin of the page access is Google's IP instead of your server.

11:21 pm on July 28, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member quadrille is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 22, 2002
posts:3455
votes: 0


That makes sense, thanks.

Next question; I can see several reasons why SEs might want to maintain a cached copy for their convenience; what I've never understood is why this is published at all.

Has any SE ever offered a reason, or is just some tradition from the early days, that has never been reversed?

Or is it a sop to civil liberties "here, matey, you can see what we're caching, now stop moaning"?

12:14 am on July 29, 2008 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


what I've never understood is why this is published at all

Imagine if your ads were all over those pages, maybe that's their long term end game when people finally get used to the idea they have no control over their own content or eventually someone files suit about fair use and loses, then it's game over.

Why publish it is obvious, keep the visitor on the search engine and not YOUR site.

Read up in some posts earlier in this thread and you'll see comments about people reading the cache page to see where the keywords are, yada yada, so in effect the SE has already accomplished locking people into their site with more opportunity to push ads in their face.

NOARCHIVE is just a small way to take back control from the SEs and avoid the situation getting worse as it certainly won't get any better.

Now if we could just get large shared hosts to put "Header append X-Robots-Tag "noarchive"" as part of their standard server configuration the SEs would lose a significant amount of cached pages in just a few days.

12:41 am on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member quadrille is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 22, 2002
posts:3455
votes: 0


Hmmm.

But they had published caches before ads figured so much ... and despite the number of 'insiders' who view and use the cache, do searchers do so in the wider world?

Most people I know wouldn't know what a cache was, and have probably never clicked on one in their life.

Don't get me wrong; I've read the earlier posts very carefully, and it's certainly made me think about my position, but most of the issues raised just didn't apply back in the day.

Now if we could just get large shared hosts to put "Header append X-Robots-Tag "noarchive"" as part of their standard server configuration the SEs would lose a significant amount of cached pages in just a few days.

Certainly a choice that should be offered.

12:52 am on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1378
votes: 18


I believe the original idea was to keep the page available if the server was down.

This would make sense from the search engine's point of view as it would benefit their users.

I don't think the potential abuses (or legal issues) were considered back then.

...

7:22 am on July 29, 2008 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 10, 2006
posts:627
votes: 0


Most people couldn't tell where the scraping occurs but I cloak small data fragments into my web pages that identify the source of the page request for tracking purposes.

This could be done with pictures as well (just as a side note to this discussion...)

7:53 am on July 29, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:July 26, 2003
posts:87
votes: 0


Isn't Google losing money because of the cache? When I see my cached pages with adsense it looks like they aren't displaying the correct ads so Google must be losing money too.
9:03 am on July 29, 2008 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Isn't Google losing money because of the cache? When I see my cached pages with adsense it looks like they aren't displaying the correct ads so Google must be losing money too.

Think branding and customer retention.

The cache page is just a gimmic that holds the customer on the site.

Even if AdSense isn't working well in the cache it's bad for you, not Google. When the customer hits BACK they see and possibly click on the Google ads and get 100% of the earnings on their own page without sharing it with your site.

So they get 100% of the ad money and hold the customer on their site longer instead of releasing the customer to your site.

Get the picture?

Anyone using cache is win-win for Google and lose-lose for webmasters.

Just say NOARCHIVE and take back your customers.

9:42 am on July 29, 2008 (gmt 0)

New User

10+ Year Member

joined:Dec 6, 2005
posts:4
votes: 0


> So you never go to the actual website and just read cache?

As you might gess, that's not the rule. If I am convinced that the content is worth reading, of course I follow the links on that site read more, bookmart it and talk about it on the fora I visit. I think that you understand that would be a big hassle to go back to google to underline the other pages' keywords...

I admit that if you site do not have content worth visiting, it is better to use NOARCHIVE, I would also benefit by not landing on your site. Please do.

3:16 pm on July 29, 2008 (gmt 0)

Full Member

10+ Year Member

joined:Apr 12, 2006
posts:290
votes: 0


I think Marcia is right to a degree - I also tend to wonder if a site is hiding something like cloaking when I can't see the cached copies.

Competitors *do* use the cached copy sometimes to try to derive how pages are ranking for particular keywords. Viewing the cache can show whether keywords are found in links pointing to the page, for instance.

If you wanted to keep some of your search optimization work just slightly more obscured, keep the pages from getting cached. It doesn't always mean you're cloaking.

Overall, I think to cache or not to cache decisions should be made on a case by case basis.

3:39 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member bwnbwn is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 25, 2005
posts:3594
votes: 48


I would say that the vast majority of internet users don't have a clue what cache is or for that matter have it enabled on Google tool bar. Most if not all computes come now with the G, Y or MSN search in the browser and so most of the population don't even have a clue as to what a cache is or isn't.

I think the discussion should be mostly towards unwanted intrusion from competition and scrapers.

I have made it a point when I visit a person to view what they have installed as to tool bars and can say over 95% don't have a tool bar installed maybe higher and if they do 99.9999% of them won't go in an enable this through options.

So I feel if your not wanting your page cached it is just to keep unwanted competition from viewing your work highlighted in terms they search and or scrapers from getting it without your knowledge.

Using the cache to keep a normal user on the SE is in my option not a real valid point, but blocking bots and unwanted eyes is a very valid point.

4:05 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member quadrille is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 22, 2002
posts:3455
votes: 0


There's little doubt that a 'mass withdrawal' from cache would make it easier for cloakers to hide - more camoflage - and this would weaken the 'cloak flag' that was the cache.

Mind you, I'd guess that google has identified other markers by now.

I'd agree that for most of us, security is the issue, rather than civil liberties or 'sticking it to Google'

4:31 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 26, 2006
posts:1397
votes: 0


I use image blocking (hotlinking) on one site so my cache doesn't look so hot. I only block the Google cache on a site with a book (copyright issues). An author who sued Google wrt caching was dissed by the judge when he ruled in G's favor:

... the fact that [plaintiff] Field, who had admitted he knew how to disable the caching feature, had not done so created "an implied licence in favour of Google", said Judge Jones.

I also think cache blocking is great for controversial/free speech sites b/c opposing counsel won't necessarily find questionable content. I think I read somewhere a while back Slashdot users use the cache more than anyone. It's a geek thing.

p/g

4:35 pm on July 29, 2008 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


I'd agree that for most of us, security is the issue, rather than civil liberties or 'sticking it to Google'

For me it's also a matter of copyright issues.

Most people use the DMCA to stop others from stealing their content yet allow Google to freely republish their entire website verbatim as individual cache pages without repercussion.

Since when did protecting your copyright become OPT-OUT, such as using NOARCHIVE, instead of allowing cache pages?

Using the cache to keep a normal user on the SE is in my option not a real valid point

If it wasn't a valid point then why so all the SE's have a cache page?

If it had no value, they wouldn't all be doing it because the masses of people using the SE's aren't SEO's or webmasters examining pages for keywords and signs of cloaking.

I'm pretty sure it is useful for customer retention for people like paulmizen who want the keywords on the page highlighted.

If my website doesn't provide that feature and Google does, which page do you think someone might find more useful?

Companies don't invest a lot of money in features that have no ROI.

4:58 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 16, 2003
posts:992
votes: 0


I just noticed that Google has a cache of one of my cached pages on IncyWincy. They should fix this.

Up until now I've not excluded any pages from the cache, but it's a good time to experiment. I use the cache a lot, and I've noticed that it has become slightly more common to find sites using the noarchive directive. So the whole idea that these sites must be up to no good is out of the window, it might have been an indicator of sneakiness in 2005 but not any more.

Like potentialgeek I use hotlink protection, so my cached version doesn't look great either, but this only affects real human users. Scrapers will usually be sending a blank referrer anyway.

5:00 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member quadrille is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 22, 2002
posts:3455
votes: 0


For me it's also a matter of copyright issues.

Legally, of course, there is no copyright issue, as Google won when a dispute came to court.

I remember being stunned when the court accepted Google's statment that caches were generally held "for a couple of weeks" - though my memory is not 100% on that quote.

While routine cache publication of whole pages is blindingly obviously in breech of national and international law, I don't see anyone rushing to establish this in court.

5:07 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 16, 2003
posts:992
votes: 0


Blake Field's case was held in a US court. It's not definitive for the whole world.
5:45 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member quadrille is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 22, 2002
posts:3455
votes: 0


Sure, and it never reached the supreme court in the US.

But it has served to deter challengers, both within and outside the US.

Google are noted for the size of their pockets.

5:47 pm on July 29, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12172
votes: 61


I don't see anyone rushing to establish this in court.

That's because I don't think there is a lawyer out there who would understand the ramifications of what "may" be happening with cache.

That's fine. Since we can't fix the source, we can always prevent the source from displaying cache. That's the first step. Now, if we can get at least 10% of the web to noarchive, that would have to put a rather large dent in revenue, yes?

I'll be back with a bunch more. Me schedule is a bit whacked right now but I'm chomping at the bit to get verbose on this topic!

[edited by: pageoneresults at 6:21 pm (utc) on July 29, 2008]

5:52 pm on July 29, 2008 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Legally, of course, there is no copyright issue, as Google won when a dispute came to court.

Just because they won doesn't mean it's still not a legal issue, it only means they got a favorable ruling in a lower court ;)

However, it's much cheaper to install NOARCHIVE that hire a legal swat team.

6:19 pm on July 29, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:July 26, 2003
posts:87
votes: 0


Well I have to say I had a link disaster on my site and people that know my site are using the cache to get the links so it is useful even if it keeps the visitor at the SE site. I only wish they would fix adsense so the ads display on the content of the original page and give us credit.
7:48 pm on July 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member bwnbwn is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 25, 2005
posts:3594
votes: 48


Well it looks like I need to do the "Back Up Jack". I took this off reading Matt's blog so if they are turning on the PR then they are turning on the cache as well, so I must be mistaken to my numbers and just live in a hillbilly town.

The last time I checked, many many more users turned on the PageRank display than there are site owners. The PageRank display is actually a popular feature, as it turns out.
This 82 message thread spans 3 pages: 82
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members