Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How does G treat Archive.org - old sites and past copyright infringements?

         

Angonasec

5:54 am on Jul 12, 2008 (gmt 0)



It occurs to me that the SEs must treat archive.org with very special care.

Consider three cases:

1) You began on the web in '98 with sites at several of the free website providers. Archive.org still has numerous complete copies of each of these near identical sites, long after you shut them down completely, and removed all content. Say ten sites with the same content.

You moved to your own proper domain in 2000, and posted much of the text of those early pages. (Hopefully now with far less embarrassing html.)

Does G count the "duplicate copies" still in archive.org against your current site? Since you put them on the web earlier, they may be viewed as the more legitimate domain, and your newer domain as a copy.

2) A different case: There's never been an older domain with your content, just the archive.org copies of it down the years, does G count the archive.org copies against your domain. Sounds a silly question, but I'm not convinced G et. al. are incapable of making such a blunder.

3) You had text stolen by infringers, you noticed after a few months, and DMCA'd their hosts, who removed the infringing text. Archive.org still has numerous copies of the infringer's sites with your text on. Does G ignore them, or accumulate them to eventually trip a duplicate content filter?

Should we go back through our huge list of takedown notices, dig out every single copy still in archive.org, and serve archive.org with a DMCA for each one?

Any definite word from G on this?

tedster

6:41 am on Jul 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We can be sure that Google is very sensitive about the archive.org issue. One bit of evidence is that we don't hear anything in this forum about crossed wires involving the Wayback Machine.

Here are two parts of the picture that I know of:

1. Last fall, a significant spam vector was launched using some redirect pages that had been archived in the Wayback Machine. This lasted only a very short time, thanks to collaboration between archive.org and Google.

2. Anna Lynn Patterson, the author of Google's six phrase-based indexing patents, came to Google from archive.org. She is extremely well informed, and technically adept with large data sets. I feel certain she was a valuable liaison. She has now moved on from Google and is a major part of the team working on an alternative search engine, cuill.com

Angonasec

7:37 am on Jul 12, 2008 (gmt 0)



Thanks Ted, I didn't mean to tread on any toes, but it'd be comforting to have definitive answers from G in addition to your useful note.

Having done everything I can think of to remove an "unknown G factor" punishing our site, I wonder if old dead copies in IA are to blame?

They are sitting there in their dozens in IA.

Then there are the infringements we long since dmca'd out of guilty sites, but still crowing loudly in IA, and doubtless crawled diligently by Gbot.

[edited by: tedster at 9:37 am (utc) on July 12, 2008]