Google Links to Archive.org

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Links to Archive.org

Brett_Tabke

12:26 pm on Sep 12, 2024 (gmt 0)

Google fighting back against a web built in it's own image, by linking to the web of the past?

Or

Because it is nearly impossible to remove content from Archive.org, seems like calling a scraper site an Archive is legitimate?

Google Search results will now directly link to The Internet Archive to add historical context for the links in your results.

Google Search makes it easy to find information, but occasionally you need historical context for a page that may have been recently updated. That was previously possible to a certain extent through cached pages in Search, but that functionality was removed earlier this year.
[9to5google.com...]

engine

3:28 pm on Sep 12, 2024 (gmt 0)

This is going to be interesting. Only research will prove how good or bad this may be.

brotherhood of LAN

3:58 pm on Sep 12, 2024 (gmt 0)

>Because it is nearly impossible to remove content from Archive.org

It isn't. You block their UA and then you have no access to the historical archive. A site could have been going for 30 years and the new owner of the domain/hosting can basically block historical access.

lucy24

4:02 pm on Sep 12, 2024 (gmt 0)

Ooh, I like this. (But why does autocorrect not know that scraping and scrapping are different words with, in this case, nearly opposite meaning? Makes me think of the classmates in high-school drafting who equipped their home designs with dinning rooms, suggesting that they had an awful lot of younger siblings.)

Somewhere on my site I link to an article that now exists only on the Wayback Machine. The author was then a grad student in {relevant field} and went on to become a {superb job title} at Google.

Brett_Tabke

4:10 pm on Sep 12, 2024 (gmt 0)

> It isn't. You block their UA and then you have no access to the historical archive

have done so for 25 years we are still in there at multiple sites. Even blocked their IP.

londrum

5:59 pm on Sep 12, 2024 (gmt 0)

just discovered that my site has pages listed on the archive, even though their crawler has been blocked in robots.txt forever
with
User-agent: ia_archiver
Disallow: /

had a little nose on the internet and it seems lots of people are complaining that they're ignoring robots

i'm not sure that ia_archiver is even their proper crawler anymore, but i can't find any updated info on their site

aristotle

6:06 pm on Sep 12, 2024 (gmt 0)

A lot of really valuable information that's no longer live on the web is still preserved at archive.org. I link to at least a dozen old articles there because the originals are gone.

Brett_Tabke

8:16 pm on Sep 12, 2024 (gmt 0)

> still preserved at archive.org

Completely irrelevant. The right to owner ship is with the copyright holder - not scraper sites like archive.org.

tangor

6:52 am on Sep 13, 2024 (gmt 0)

A LOT of the stuff at archive.org is OUTDATED or even INACCURATE. Wonder how that will fit in with all this AI scraping, er, building LLMs is going to work out.

aristotle

1:26 pm on Sep 13, 2024 (gmt 0)

You don't have to go to archive.org to find scraped content, or outdated content, or lies and mis-information for that matter. It's a basic characteristic of the web.

lucy24

4:57 pm on Sep 13, 2024 (gmt 0)

Similarly, libraries should be required to purge all earlier editions and outdated reference works, since their content has no value to anyone under any circumstances.