Forum Moderators: open

Message Too Old, No Replies

Internet Archive Has 44 petabytes Worth of Data

         

engine

5:36 pm on Oct 8, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



We recently discussed how the Internet Archive helped Wikipedia with a few million broken links. [webmasterworld.com]

Did you know that the Internet Archive has 44 petabytes worth of data, and adds four petabytes each year.

You can hear the presentation by Mark Graham of the Wayback Machine here on Soundcloud .

[soundcloud.com...]

Leosghost

8:28 pm on Oct 8, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Did you know that the Internet Archive has 44 petabytes worth of your websites data, and adds four petabytes each year.

FTFY engine :)

justpassing

8:53 pm on Oct 8, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



i don't know if I can share the link, but once, I read what happened to a web master, who closed one of his site, then after a while, he released the domain name (something you should never do), a few weeks after the expiration, he found out that someone had acquired the domain name (this is not surprising), and, downloaded the original site from the Internet Archive and put it back online, without even modifying the contact information, etc... I bet this happens often, and scrappers certainly love the Internet Archive ...

engine

9:34 pm on Oct 8, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Webmasters can block the archive if they choose.

justpassing

9:42 pm on Oct 8, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



Webmasters can block the archive if they choose.

This is not that easy. From my experience, their crawler is respecting neither the robots.txt directives nor the noindex tag. Once, I had to write them, and expose issues with the fact they were archiving my sites, whereas they shouldn't have and they more or less explained that they didn't know what the issue could be (note that they answerd, which is still a good point).

ps: personally, until recently, I thought that the noindex tag, was fine for all legitimate crawlers (including the Internet Archive, but I learned that it is not...)

lucy24

10:50 pm on Oct 8, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



someone had acquired the domain name . . . and downloaded the original site from the Internet Archive and put it back online
I am filled with admiration.

their crawler is respecting neither the robots.txt directives nor the noindex tag
“Block” doesn’t mean put up a sign that says No Admittance, or Employees Only. It means deadbolt the door.