Forum Moderators: Robert Charlton & goodroi
Consider three cases:
1) You began on the web in '98 with sites at several of the free website providers. Archive.org still has numerous complete copies of each of these near identical sites, long after you shut them down completely, and removed all content. Say ten sites with the same content.
You moved to your own proper domain in 2000, and posted much of the text of those early pages. (Hopefully now with far less embarrassing html.)
Does G count the "duplicate copies" still in archive.org against your current site? Since you put them on the web earlier, they may be viewed as the more legitimate domain, and your newer domain as a copy.
2) A different case: There's never been an older domain with your content, just the archive.org copies of it down the years, does G count the archive.org copies against your domain. Sounds a silly question, but I'm not convinced G et. al. are incapable of making such a blunder.
3) You had text stolen by infringers, you noticed after a few months, and DMCA'd their hosts, who removed the infringing text. Archive.org still has numerous copies of the infringer's sites with your text on. Does G ignore them, or accumulate them to eventually trip a duplicate content filter?
Should we go back through our huge list of takedown notices, dig out every single copy still in archive.org, and serve archive.org with a DMCA for each one?
Any definite word from G on this?
Here are two parts of the picture that I know of:
1. Last fall, a significant spam vector was launched using some redirect pages that had been archived in the Wayback Machine. This lasted only a very short time, thanks to collaboration between archive.org and Google.
2. Anna Lynn Patterson, the author of Google's six phrase-based indexing patents, came to Google from archive.org. She is extremely well informed, and technically adept with large data sets. I feel certain she was a valuable liaison. She has now moved on from Google and is a major part of the team working on an alternative search engine, cuill.com
Having done everything I can think of to remove an "unknown G factor" punishing our site, I wonder if old dead copies in IA are to blame?
They are sitting there in their dozens in IA.
Then there are the infringements we long since dmca'd out of guilty sites, but still crowing loudly in IA, and doubtless crawled diligently by Gbot.
[edited by: tedster at 9:37 am (utc) on July 12, 2008]