Welcome to WebmasterWorld Guest from 54.234.38.8

Forum Moderators: open

Message Too Old, No Replies

Google Asked To Hand Over Database Archives

New archive wants Google Database

     
1:19 pm on Aug 12, 2004 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38047
votes: 11


[siliconvalley.com...]

In that spirit, he also has asked Google to furnish him with a copy of its database, say with a six-month delay so Google's competitiveness doesn't suffer.

Google has yet to grant his request. But Kahle hopes the company will come around, especially in light of its claim that it wants to have a positive impact on the world. A Google spokeswoman declined to comment.

5:10 pm on Aug 12, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 12, 2002
posts:4479
votes: 0


>What about the WayBackMachine? Not doing a good enough job?

Kahle is the *founder* of the WaybackMachine, which is located at archive.org. They are one in the same.

5:40 pm on Aug 12, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 16, 2003
posts:1298
votes: 0


ahh, w/o archive, how else can we look back stuff like this?

[web.archive.org...]

6:58 pm on Aug 12, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


Correct me if I'm wrong, but isn't the web already an open archive?

Its open, but its not archive - it is unthinkable for archive to only contain the latest version of the book (not all revisions no matter how minor), it is also unthinkable for archive to delete stuff when it "gone out of print".

Web is just a current snapshot, where as an archive (good one!) is a full featured movie.

7:14 pm on Aug 12, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Jan 13, 2004
posts:208
votes: 0


One problem with archive.org is if the bot is *ever* blocked by robots.txt, ALL previously stored pages are deleted.

Not deleted, but made "unavailable." A real-time check is made to see if robots.txt exists before perviously stored pages are shown. That doesn't mean they are deleted -- far from it. They're merely waiting until such time that the information can be useful once again.

I had a domain blocked and then I sold the domain name and moved everything to a new domain. The new owners don't have a robots.txt. Suddenly all the stuff I thought I had been blocking came alive and available on archive.org -- six years worth. I have no control over this because I no longer own the domain.

Archive.org is a slimy operation in my opinion. No respect for the rights of webmasters.

7:30 pm on Aug 12, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


Not deleted, but made "unavailable."

Some are made "unavailable", some are lost however - 404s about for all sort of reasons. Whatever the reason it aint an archive if some items are "unavailable" - excusable for old fashined archives where old rare books could be subject to restoration procedures, but not acceptable for digital archive that can generate copies at near zero cost.

Suddenly all the stuff I thought I had been blocking came alive and available on archive.org -- six years worth

Sorry to hear that but it is naive to "protect" content using robots.txt. If it was meant to be for registered (paying) users then you should have password protect it.

The way I current situation is that there is a balanced compromise between free pages that can make into search engine or being paid but not making into it.

7:39 pm on Aug 12, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 12, 2002
posts:4479
votes: 0


In that case, I stand corrected Scarecrow. I will call your attention then to:

[archive.org...]
[archive.org...]

You are going to have to read up on the DMCA, and how to do a proper removal request. If you created that content, and didn't sell the rights to that content when you sold the domain, then you have the copyright on it.

7:48 pm on Aug 12, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Jan 13, 2004
posts:208
votes: 0


Crawling the web means never having to say you're sorry. I can show you pages of mine at archive org, where if you click "show source" you can see that there's a NOINDEX, NOFOLLOW meta in the headers.

Class action, anyone?

Actually, I think the ROBOTS, NOARCHIVE meta is working. I use that on everything now. Of course, before Yahoo kicked in this year, it was always GOOGLEBOT, NOARCHIVE - which did not work at archive.org.

I guess technically, archive.org isn't "indexing."

8:00 pm on Aug 12, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 12, 2002
posts:4479
votes: 0


Hmm...

[robotstxt.org...]

Highlighted in red:

"Note: Currently only few robots support this tag!"

[robotstxt.org...]

No mention of the meta tag there in the official RFC.

It doesn't look like the meta tags are officially recognized. Only robots.txt. And, as you point out archiving and indexing are not necessarily the same. There may be people who actually want archive.org to archive their site, but it not appear in search engines. Would make sense for a site that expected almost all traffic would be from links on other sites, and not through SEs.

8:20 pm on Aug 12, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Jan 13, 2004
posts:208
votes: 0


This is getting humorous. You can actually call up old copies of your robots.txt at archive.org. I have many dozens listed and they clearly show:

User-agent: ia_archiver
Disallow: /

Time to dash off a fax to millionaire whiz-kid Brewster Kahle and ask him to take out all six years worth. Their own evidence gives me a prima facie case, which I wouldn't be able to prove otherwise!

8:56 pm on Aug 12, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


Um, didn't Amazon buy webarchive.org? Why should Google give Amazon their database?
This 67 message thread spans 7 pages: 67