Welcome to WebmasterWorld Guest from 52.91.245.237

Forum Moderators: open

Internet Archive Has 44 petabytes Worth of Data

     
5:36 pm on Oct 8, 2018 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25730
votes: 822


We recently discussed how the Internet Archive helped Wikipedia with a few million broken links. [webmasterworld.com]

Did you know that the Internet Archive has 44 petabytes worth of data, and adds four petabytes each year.

You can hear the presentation by Mark Graham of the Wayback Machine here on Soundcloud .

[soundcloud.com...]
8:28 pm on Oct 8, 2018 (gmt 0)

Senior Member from FR 

WebmasterWorld Senior Member leosghost is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Feb 15, 2004
posts:7139
votes: 410


Did you know that the Internet Archive has 44 petabytes worth of your websites data, and adds four petabytes each year.

FTFY engine :)
8:53 pm on Oct 8, 2018 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Sept 13, 2018
posts:355
votes: 68


i don't know if I can share the link, but once, I read what happened to a web master, who closed one of his site, then after a while, he released the domain name (something you should never do), a few weeks after the expiration, he found out that someone had acquired the domain name (this is not surprising), and, downloaded the original site from the Internet Archive and put it back online, without even modifying the contact information, etc... I bet this happens often, and scrappers certainly love the Internet Archive ...
9:34 pm on Oct 8, 2018 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25730
votes: 822


Webmasters can block the archive if they choose.
9:42 pm on Oct 8, 2018 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Sept 13, 2018
posts:355
votes: 68


Webmasters can block the archive if they choose.

This is not that easy. From my experience, their crawler is respecting neither the robots.txt directives nor the noindex tag. Once, I had to write them, and expose issues with the fact they were archiving my sites, whereas they shouldn't have and they more or less explained that they didn't know what the issue could be (note that they answerd, which is still a good point).

ps: personally, until recently, I thought that the noindex tag, was fine for all legitimate crawlers (including the Internet Archive, but I learned that it is not...)
10:50 pm on Oct 8, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15314
votes: 708


someone had acquired the domain name . . . and downloaded the original site from the Internet Archive and put it back online
I am filled with admiration.

their crawler is respecting neither the robots.txt directives nor the noindex tag
“Block” doesn’t mean put up a sign that says No Admittance, or Employees Only. It means deadbolt the door.