Forum Moderators: phranque
Search engines automatically cache your pages and something called the Internet Archive, or Wayback Machine, also comes along and makes a permanent copy of your site for "posterity". The problem starts when you realize you may have content on your web site that could result in legal issues. You may act quickly to resolve those issues yet the problems still remain without your knowledge because you didn't act as quickly as all the robots crawling your site.
Unfortunately, legal beagles love that your site was saved for "posterity" when gearing up to file a lawsuit so although you've already done the right thing by cleaning potentially harmful things off your site, the tireless automatons crawling the internet have made sure there's plenty of evidence and the next thing you know, you're about to get hung out to dry.
If you think the lawyers aren't technically savvy, think again:
Browsing a party's Web site will only show the information that the Web site owner currently wants visitors to see. Sometimes, the most valuable information about an opposing party is the information that has been changed or removed. Fortunately, there are ways to see older versions of Web pages. Pages that were changed recently can be viewed through Google's cache feature. Pages that were changed months or years ago may be available through the Internet Archive, also known as the Wayback Machine.
[law.com...]
Not only can they find your content, they do it under cloak without your knowing about it!
Viewing these older versions of Web pages avoids the privacy risks discussed above: The copied pages are not on the company's Web site, so the company has no record of the researcher's activities.
You can forget your rights, just throw them out the window, because the history of your website is already busy squealing on you without your knowledge or permission.
HOW DO YOU PROTECT YOUR SITE FROM HISTORICAL SNOOPING?
Obviously the simplest way is to keep your nose clean so nobody has a reason to be snooping in the first place.
However, this is the internet and you have to OPT-OUT of things to protect your rights.
Here's a few preventative ways to stop your website from being archived and being used as a snitch:
USE NOARCHIVE
Make sure you include the NOARCHIVE meta tag in each web page so that there is no cache in any of the major search engines.
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
USE ROBOTS.TXT
Block all of the archive site spiders, such as used by the Internet Archive, in your site's robots.txt file with an entry as follows:
User-agent: ia_archiver
Disallow: /
The Heritrix software [crawler.archive.org] used by the Internet Archive is Open Source which means there are more archives out there and possibly using deviations of Heritrix that ignore robots.txt and cloak their access to your site.
HELP FOR HOSTED BLOGGER ISSUES
If you're running a blog hosted on a 3rd party service like Blogger or WordPress, your options may be limited to just embedding NOARCHIVE which the Internet Archive ignores, meaning anyone running stock Heritrix code would also ignore by default.
The only way you can exclude your site, according to their site [archive.org], is to contact them directly. Obviously an insufficient amount of businesses and sites in general are aware of the perils posed by the Internet Archive or they would honor the NOARCHIVE tag for those sites with limited access and no robots.txt just to avoid a flood of emails.
OTHER POTENTIAL RISKS
Snap.com has taken screen shots of every web page, then Ask started taking limited screenshots as well as a some new completely graphical search engines like SearchMe. Some screen shots have minimal resolution too tiny to read but others, like Snap and SearchMe, are big enough you can read, and these too are called evidence in a lawsuit. Even the tiniest thumbnail can still show a licensed trademark being used without permission.
Some of the social bookmarking sites that allow large chunks of content to be copied such as Kaboodle, Jeteye, Eurekster, some using tools like Heritrix (see above), to make small archive copies of specific content.
SUMMARY
Obviously there's no way you can completely stop anyone from making copies of your site but it may pay by being diligent in keeping many of these technologies off your site that provide any form of archives.
This is just another form of insurance that could, in the end, save your business, your house, your car, your family...
Considering how I use the wayback machine - competitive research generally - copyright or other legal issues aside, I just don't need my competition snooping on the history of my site.
>>lol. the more I think about it
I don't get it either. Anybody that has really run a blog, forum, or been threatened/sued has to think of this on almost every sketchy post?! Am I insured with no cache/no wayback? Ah yes, no lawyer calls for now; life is good, return to beach activities...
If it was just this one thread, I'd join you
did it ever occur to you that the moderator of the Spider ID and Cloaking forums [webmasterworld.com] might be involved in most such discussions on WebmasterWorld?
I'm sure it would be possible for an attorney, or their researcher, to find something that might be potentially actionable on my site by using the internet archive. That's a calculated risk, which is part and parcel of running an internet business.
Do I really care if my competition examines the history of my site? Not really. If they care to dissect and attempt to duplicate what I was doing 3 years, 1 year or even 3 months ago, good for them. I try to envision where I need to be 3 months, 1 year or 3 years from now.
I respect the opinions of everyone who has posted, I've learned some new things, but in a way this is reminding me of why I can't read the Google Search News threads too much. It makes ME paranoid.
Do I really care if my competition examines the history of my site? Not really. If they care to dissect and attempt to duplicate what I was doing 3 years, 1 year or even 3 months ago, good for them. I try to envision where I need to be 3 months, 1 year or 3 years from now.
Actually, it tells your competition what path you took to get where you are today.
They can see when you try various things, rollback, etc. so why let them avoid any mistakes you made along the way?
FWIW, I too have used the Archive once upon a time in an infringing case but it really doesn't matter because all the DMCA requires is your affidavit claiming it's yours to get the ball rolling. Since disabling the archives I've had no trouble getting stolen content removed because staking a legal claim and signing your name to it carries quite a bit of weight until the other side makes a counter claim which thieves rarely tend to do.
My path is varied, if they choose that path, there are many nuances that surely won't show up on an Archive view. They're welcome to play chase.
Archive use works well when not dealing with SE DMCA. Industry specific hosting sites can understand and deal with infringers quickly when presented with independent third party "evidence".
We all have our different takes on what is critical, and what isn't, which is why I referenced the calculated risk we all take.
I appreciate you bringing the topic up, whether I agree or disagree, it's a valid concern that should be discussed.
<BASE HREF="http://example.com">
and
<IMG SRC="/images/myimage.jpg">
i.e archive will not save and display your image, and people "snooping" will not be able to see it (but source code for it will still be there)
It has been hinted (I have no idea how realistically), that wholesale deprivation of Google cache will make Google's life difficult; the logical next step for Google would then be to treat 'cache opt out' as 'Google opt out'. That would be fun :)
Of course a much better SE future would have webmasters opting in - just like the early days - but hopefully with options, such as cache / no cache / no images / sitemap / whatever / ...
Such a system would make it much easier for the SEs to use positive encouragement to webmasters, rather than the current negative graded penalty system, and webmasters would know exactly where they stand. So a future SEO might be saying "I can get you much higher than you are, but you've opted out of cache and you don't have a sitemap, so you'll not make the front page' etc., etc.
Is this really helpful ?
It has been hinted (I have no idea how realistically), that wholesale deprivation of Google cache will make Google's life difficult; the logical next step for Google would then be to treat 'cache opt out' as 'Google opt out'. That would be fun.
No it wouldn't. That would be a disaster for the SEs and those who have been following protocol.
There are very logical reasons why many would not want their pages cached. Product pages are primary candidates. Pages where content changes frequently, etc. We just haven't really given it much thought over the years as we were in the mindset of getting "everything" indexed, cached, whatever may assist in maintaining visibility within the SEs.
These past couple of years have been a real eye opener for me as I've ventured into different black holes and learned a little in the process. Enough for me to decide that it was time to start blocking the indexing and/or caching of content.
There are too many topics about "this" and "that" with the words "cache" and "archive" mentioned for me not to take notice. And, instead of trying to backtrack and understand what may be going on, I'm just going to opt out for now since the option is available.
Until I see in writing, from the search engines, that they "guarantee" there is no funny business going on with cache and/or Internet archives, I'm taking a proactive stance.
Bad data push?
Oh-oh, there goes 100,000 page indexed and into the cache with a 30% discount on our entire product line.
Author gone postal?
Oh-oh, there goes an entire topic indexed and cached for all to see.
Glitch in user profiles?
Oh-oh, how the hell did that happen? When did it happen? Lawsuit incoming...
The list goes on and on. I'm concerned for the integrity of my clients data, I really am. Not because there is anything fishy going on there but, because of what may happen to it "after the fact".
deprivation of Google cache will make Google's life difficult
How so?
Showing cache pages doesn't change how the search works.
The only thing it might make difficult for Google or any other SE is to retain OUR customers longer. Search engines were always supposed to help visitors find us faster, not serve up our content to those visitors. That's when the SE crosses the line between helping a website and competing with the website for it's own customer. The longer the SE holds the customer the longer they have a chance at the PPC revenue.
You think that's fair that they use your site in that manner with your own cache?
Maybe you don't make your money online and it doesn't matter to you that the biggest fish in the pond is trying to become the actual pond itself, but it matters to me!
So a future SEO might be saying "I can get you much higher than you are, but you've opted out of cache
If it actually comes to that you know we're being literally blackmailed to turn over control of our copyright because the cache is always there, it's whether we allow it to be displayed or not.
Since you brought it up, Google's description of cache even alludes that it needs the cache to judge the relevance of the page, almost implying that if you disable cache it can't tell the relevance which isn't true at all.
[Google.com...]
Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable. If you click on the "Cached" link, you will see the web page as it looked when we indexed it. The cached content is the content Google uses to judge whether this page is a relevant match for your query.
I shut off all my cache 3 years ago and it never hurt my site whatsoever.
So did WebmasterWorld and it still dominates the SERPs.
My question is: what time frame would be reasonable to expect google cache (and others) to clear?
I have nothing to hide, but site has a different structure than previous versions and redirecting all those changes would be PITA.
The cached content is the content Google uses to judge whether this page is a relevant match for your query.
Google can still use a cache of our pages for its own purposes, just not display it publicly if we opt out. The phrasing may suggest otherwise, but like you said, I really doubt it has any effect on ranking.
While I was tempted to block archiving and caching of my sites, I was lazy and not really convinced I should.
Then...
One of my sites got hacked and a lot of scraped content was put on certain parts of my site (directories and thousands of pages were created).
I was quick to change passwords and errase the bad content, but when I check google, they had just as quickly picked up on the scraped content and it is now in their archive.
God knows when they will update it and I don't have to time to waste trying to get them to take it down... I'll just let it happen naturally.
But man, I wish I wouldn't have sat on that fence. Thank you to those "campaigning" agaisn't the cache... I wish I would have listened earlier.