Welcome to WebmasterWorld Guest from 54.158.36.59

Forum Moderators: open

Message Too Old, No Replies

Google Asked To Hand Over Database Archives

New archive wants Google Database

     
1:19 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



[siliconvalley.com...]

In that spirit, he also has asked Google to furnish him with a copy of its database, say with a six-month delay so Google's competitiveness doesn't suffer.

Google has yet to grant his request. But Kahle hopes the company will come around, especially in light of its claim that it wants to have a positive impact on the world. A Google spokeswoman declined to comment.

5:10 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



>What about the WayBackMachine? Not doing a good enough job?

Kahle is the *founder* of the WaybackMachine, which is located at archive.org. They are one in the same.

5:40 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ahh, w/o archive, how else can we look back stuff like this?

[web.archive.org...]

6:58 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Correct me if I'm wrong, but isn't the web already an open archive?

Its open, but its not archive - it is unthinkable for archive to only contain the latest version of the book (not all revisions no matter how minor), it is also unthinkable for archive to delete stuff when it "gone out of print".

Web is just a current snapshot, where as an archive (good one!) is a full featured movie.

7:14 pm on Aug 12, 2004 (gmt 0)

10+ Year Member



One problem with archive.org is if the bot is *ever* blocked by robots.txt, ALL previously stored pages are deleted.

Not deleted, but made "unavailable." A real-time check is made to see if robots.txt exists before perviously stored pages are shown. That doesn't mean they are deleted -- far from it. They're merely waiting until such time that the information can be useful once again.

I had a domain blocked and then I sold the domain name and moved everything to a new domain. The new owners don't have a robots.txt. Suddenly all the stuff I thought I had been blocking came alive and available on archive.org -- six years worth. I have no control over this because I no longer own the domain.

Archive.org is a slimy operation in my opinion. No respect for the rights of webmasters.

7:30 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not deleted, but made "unavailable."

Some are made "unavailable", some are lost however - 404s about for all sort of reasons. Whatever the reason it aint an archive if some items are "unavailable" - excusable for old fashined archives where old rare books could be subject to restoration procedures, but not acceptable for digital archive that can generate copies at near zero cost.

Suddenly all the stuff I thought I had been blocking came alive and available on archive.org -- six years worth

Sorry to hear that but it is naive to "protect" content using robots.txt. If it was meant to be for registered (paying) users then you should have password protect it.

The way I current situation is that there is a balanced compromise between free pages that can make into search engine or being paid but not making into it.

7:39 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



In that case, I stand corrected Scarecrow. I will call your attention then to:

[archive.org...]
[archive.org...]

You are going to have to read up on the DMCA, and how to do a proper removal request. If you created that content, and didn't sell the rights to that content when you sold the domain, then you have the copyright on it.

7:48 pm on Aug 12, 2004 (gmt 0)

10+ Year Member



Crawling the web means never having to say you're sorry. I can show you pages of mine at archive org, where if you click "show source" you can see that there's a NOINDEX, NOFOLLOW meta in the headers.

Class action, anyone?

Actually, I think the ROBOTS, NOARCHIVE meta is working. I use that on everything now. Of course, before Yahoo kicked in this year, it was always GOOGLEBOT, NOARCHIVE - which did not work at archive.org.

I guess technically, archive.org isn't "indexing."

8:00 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Hmm...

[robotstxt.org...]

Highlighted in red:

"Note: Currently only few robots support this tag!"

[robotstxt.org...]

No mention of the meta tag there in the official RFC.

It doesn't look like the meta tags are officially recognized. Only robots.txt. And, as you point out archiving and indexing are not necessarily the same. There may be people who actually want archive.org to archive their site, but it not appear in search engines. Would make sense for a site that expected almost all traffic would be from links on other sites, and not through SEs.

8:20 pm on Aug 12, 2004 (gmt 0)

10+ Year Member



This is getting humorous. You can actually call up old copies of your robots.txt at archive.org. I have many dozens listed and they clearly show:

User-agent: ia_archiver
Disallow: /

Time to dash off a fax to millionaire whiz-kid Brewster Kahle and ask him to take out all six years worth. Their own evidence gives me a prima facie case, which I wouldn't be able to prove otherwise!

8:56 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Um, didn't Amazon buy webarchive.org? Why should Google give Amazon their database?
This 67 message thread spans 7 pages: 67
 

Featured Threads

Hot Threads This Week

Hot Threads This Month