Google crawls and indexes Archive.org

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google crawls and indexes Archive.org

Angonasec

2:53 pm on Jan 13, 2009 (gmt 0)

site:archive.org today gives 9.75 million pages in G.

How does your duplicate content on those pages affect your site's G ranking?

We don't know, we just hope G don't ever get it wrong.

I've used G alerts since they began, but I've never seen an archive.org url cited in an alert... until today.

It's not an exact match of the phrase, just parts of it, nevertheless the point is, it is an
[archive.org...] url and so must be incorporated in the Google index at some level.

Deliberate or a slip up?

Another very good reason to get your sites pulled from archive.org as we have done.

20+ sites, but It wasn't too laborious a process.
Completed in 3-4 days. Gone from archive.org, wayback, and the dreaded alexa :)

Receptional Andy

4:23 pm on Jan 13, 2009 (gmt 0)

Interesting that you had something show in Google alerts, but note that web.archive.org (where all the web pages are stored) is excluded via robots exclusion:

[web.archive.org...]

The URLs can still get URL-only listings if people link to them, but I've not seen anything else.

Was it a URL-only listing you had in Google alerts? (Note that sometimes such listings have a title of the link text pointing to them.)

The only other possibility is if archive.org are accidentally exposing their listings.

outland88

6:27 pm on Jan 13, 2009 (gmt 0)

As I reported in another thread Google is now allowing all sorts of duplicate content from various search engines and domain lookup services. Quite a bit is now competitive with your natural results or will be soon.

rustybrick

1:13 pm on Jan 14, 2009 (gmt 0)

[google.com...] returns only two results for me. the main www.archive.org is different, no? The web.archive.org is where the dup issues may come from?

Angonasec

2:18 am on Jan 15, 2009 (gmt 0)

The Google Alert displays both a heading and the searched for terms, and an archive.org url like this:

www.archive.org/stream/<rest of url>

The G serp shows a normal; Title, url and a snippet.

The target page in www.archive.org is cached in G, and the cache shows the G Alert terms.

The G cache url is formatted like this:

[209.85.175.132...] of url>

It looks to me like trouble at Google, and for everyone still unfortunate enough to be in archive.org

tedster

3:00 am on Jan 15, 2009 (gmt 0)

And as Receptional_Andy already posted, all the archived versions of websites are served from web.archive.org - NOT www.archive.org

There is no problem here.

Angonasec

1:39 am on Jan 16, 2009 (gmt 0)

I understand tedster, but there are indeed 9.75 million pages in G for the term site:archive.org, >>all copies of our content<<, and the Alert shows they are in the mix. That is the point.

tedster

3:36 am on Jan 16, 2009 (gmt 0)

And there is just one result for site:web.archive.org/web/ - that's where all the Wayback machine copies are served. And even that one page is a url-only result, thanks to the robots.txt file.

Did the alert you received point to a copy of one of your web pages?