Internet Archive Has 44 petabytes Worth of Data

Forum Moderators: open

Message Too Old, No Replies

Internet Archive Has 44 petabytes Worth of Data

engine

5:36 pm on Oct 8, 2018 (gmt 0)

We recently discussed how the Internet Archive helped Wikipedia with a few million broken links. [webmasterworld.com]

Did you know that the Internet Archive has 44 petabytes worth of data, and adds four petabytes each year.

You can hear the presentation by Mark Graham of the Wayback Machine here on Soundcloud .

[soundcloud.com...]

Leosghost

8:28 pm on Oct 8, 2018 (gmt 0)

Did you know that the Internet Archive has 44 petabytes worth of your websites data, and adds four petabytes each year.

FTFY engine :)

justpassing

8:53 pm on Oct 8, 2018 (gmt 0)

i don't know if I can share the link, but once, I read what happened to a web master, who closed one of his site, then after a while, he released the domain name (something you should never do), a few weeks after the expiration, he found out that someone had acquired the domain name (this is not surprising), and, downloaded the original site from the Internet Archive and put it back online, without even modifying the contact information, etc... I bet this happens often, and scrappers certainly love the Internet Archive ...

engine

9:34 pm on Oct 8, 2018 (gmt 0)

Webmasters can block the archive if they choose.

justpassing

9:42 pm on Oct 8, 2018 (gmt 0)

Webmasters can block the archive if they choose.

This is not that easy. From my experience, their crawler is respecting neither the robots.txt directives nor the noindex tag. Once, I had to write them, and expose issues with the fact they were archiving my sites, whereas they shouldn't have and they more or less explained that they didn't know what the issue could be (note that they answerd, which is still a good point).

ps: personally, until recently, I thought that the noindex tag, was fine for all legitimate crawlers (including the Internet Archive, but I learned that it is not...)

lucy24

10:50 pm on Oct 8, 2018 (gmt 0)

someone had acquired the domain name . . . and downloaded the original site from the Internet Archive and put it back online

I am filled with admiration.

their crawler is respecting neither the robots.txt directives nor the noindex tag

�Block� doesn�t mean put up a sign that says No Admittance, or Employees Only. It means deadbolt the door.