homepage Welcome to WebmasterWorld Guest from 23.20.34.25
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 67 message thread spans 3 pages: 67 ( [1] 2 3 > >     
Google Asked To Hand Over Database Archives
New archive wants Google Database
Brett_Tabke




msg:151877
 1:19 pm on Aug 12, 2004 (gmt 0)

[siliconvalley.com...]

In that spirit, he also has asked Google to furnish him with a copy of its database, say with a six-month delay so Google's competitiveness doesn't suffer.

Google has yet to grant his request. But Kahle hopes the company will come around, especially in light of its claim that it wants to have a positive impact on the world. A Google spokeswoman declined to comment.


 

Lord Majestic




msg:151878
 1:27 pm on Aug 12, 2004 (gmt 0)

Might be a bit hard for a soon-to-be public company to give away what are essentially their main assets.

The Web is growing at about 20 terabytes of compressed data a month, which is manageable, Kahle said.

20TB compressed (100TB uncompressed) is about 2 bln pages (50% of Google's index) @ 50kb per page. Sounds a bit excessive to me...

[edited by: Lord_Majestic at 1:41 pm (utc) on Aug. 12, 2004]

Chris_D




msg:151879
 1:28 pm on Aug 12, 2004 (gmt 0)

If it look like there's commercial motivation, and it smells like there's commercial motivation....... and there are ways of commercialising the data - then maybe its not really about saving the world......

rfgdxm1




msg:151880
 1:41 pm on Aug 12, 2004 (gmt 0)

Exactly what is he asking from Google? Copies of their old cache of web pages? If so, sounds like a noble goal.

IITian




msg:151881
 1:42 pm on Aug 12, 2004 (gmt 0)

He learned a tough lesson when search engine Infoseek initially agreed to give him a copy of its database. When Infoseek went bankrupt, though, the lawyers didn't follow through. So Kahle is adamant that Google should act soon.

Is he trying to tell us something? ;)

What about the tracking data G is happily storing away? Who gets it?

justgowithit




msg:151882
 1:44 pm on Aug 12, 2004 (gmt 0)

He learned a tough lesson when search engine Infoseek initially agreed to give him a copy of its database. When Infoseek went bankrupt, though, the lawyers didn't follow through. So Kahle is adamant that Google should act soon.

Implying that Google will go bankrupt in the near future...? Sorry buddy, but I think Google has bigger worries at the moment than handing over the fruit of their labors since day one. It sounds like his motives are true, but who knows.

ogletree




msg:151883
 2:10 pm on Aug 12, 2004 (gmt 0)

There database is not what has valuable it is there algo and thee ability to dispense so many SERPS at once fast. Anybody can write a spider and collect the internet. I don't know what his problem is. All he needs is a good spider. They do have a lot of old data. GG has said they never throw anything away. I don't know why that would be a pboblem. It might help them out if he stores it then they don't have to spend the money to store it.

Badger37




msg:151884
 2:12 pm on Aug 12, 2004 (gmt 0)

These days the first half wouldn't really be worth having as it's only spammy directories and not 'real' sites! ;)

Lord Majestic




msg:151885
 2:17 pm on Aug 12, 2004 (gmt 0)

All he needs is a good spider.

And lots of bandwidth: 4 bln pages @ 50k weight enough to load up big fat expensive ~588Mbit pipe under ideal conditions. Its anything but cheap.

It might help them out if he stores it then they don't have to spend the money to store it.

No it won't because no sane multibillion business would try to count on external storage like that - they will still have to pay good money to store all of these anyway.

Don't even think Google will get extra brownie points for doing that as (and I might be wrong here just like in everything else) in my humble view most people care about being able to find page on Google and possibly access it from their cache.

[edited by: Lord_Majestic at 2:22 pm (utc) on Aug. 12, 2004]

trillianjedi




msg:151886
 2:22 pm on Aug 12, 2004 (gmt 0)

I'm asking the same question as rfgdxm1.

What exactly is it that he's after?

TJ

john_k




msg:151887
 2:37 pm on Aug 12, 2004 (gmt 0)

If it look like there's commercial motivation, and it smells like there's commercial motivation....... and there are ways of commercialising the data - then maybe its not really about saving the world......

I agree completely. At the very least, he will want this venture to raise enough money to pay for his own full-time job. Then there are a few administrators and IT folks that will need to be paid. So usage fees and/or government grants will follow.

If Google does, for some unfathomable reason, decide to grant his wish, then they should do so with some very tight strings attached. My suggestion would be that there is an annual license fee of a few hundred million dollars attached. As long as the project remains non-commercialized, and doesn't threaten Google's own commercial interests, then the fee is waived. When the project starts to get commercial aspirations, which it will, then G stops waiving the fee. One other thought is that the fees are "suspended" instead of waived with the provision that, should he ever sue Google to try and cut these strings, that the past fees are due immediately in full payment with interest.

Marcia




msg:151888
 2:47 pm on Aug 12, 2004 (gmt 0)

That guy is smoking socks.

encyclo




msg:151889
 2:49 pm on Aug 12, 2004 (gmt 0)

What exactly is it that he's after?

It's a publicity stunt - much the same as when they offered the film Farenheit 911 for download for a few hours when Michael Moore said something about encouraging the free distribution of the film (they pulled the film because it is still copyrighted, whatever the director says).

Archive.org is trying to place itself as the official internet library - which they simply are not. I banned their spider eons ago with no regrets.

That's not to say that a web library is a bad idea, but archive.org's incomplete, slow and awkward implementation is not the right way forward.

john_k




msg:151890
 2:55 pm on Aug 12, 2004 (gmt 0)

Brewster's goal: Store everything. ``It is possible,'' he proclaimed last week at a conference at IBM's Almaden Research Center in San Jose. ``It could be one of the greatest achievements of all time.''

A lot of things are possible. That doesn't mean that it is a good idea to do them, or that you have "entightelment" to other people's property simply because you thought of it.

If the project is important enough to do, then it is also too important to rely upon an outside commercial interest as its sole (or primary) data source. As others have said, they should build their own spider. Even if it takes a few years to get it right, what do they lose? 10 years of the early web. In the grand scheme of things, I would say that isn't too big of a deal. One hundred years from now, if this is the ONLY surviving archive, and it is missing the first ten years, it will still fall into the "that's unfortunate" category. But the world will still be spinning, and civilization won't be in worse shape for not having the data.

Build your own spider to collect your own data.

kaled




msg:151891
 3:08 pm on Aug 12, 2004 (gmt 0)

So you believe that Google should SELL (for a profit) their cached copy of the internet - surely no one out there is going say that is "fair use" with respect to copyright.

You could argue that, technical problems aside, all search engines should be under an obligation to sell their spidered data AT COST (ie no profit). After all, the data is not theirs, it belongs to the webmasters. That is not to say that indexing tables etc. should be included but raw cached copies of web pages should be available to purchase for the cost of the storage media plus an allowance for duplication.

Kaled.

ogletree




msg:151892
 3:17 pm on Aug 12, 2004 (gmt 0)

I would like to get a hold of WT db.

Lord Majestic




msg:151893
 3:17 pm on Aug 12, 2004 (gmt 0)

That is not to say that indexing tables etc. should be included but raw cached copies of web pages should be available to purchase for the cost of the storage media plus an allowance for duplication.

Due to the way Google indexes data it is possible to re-construct meaningful text from the index itself - if you ban cached data then search engines will just provide reconstruction data, even better it will be cheaper to them since index has to be stored anyway.

You also fail to appreciate that it cost a lot in terms of bandwidth to get all these pages - 1 GB of data has got a non-zero cost of getting.

rfgdxm1




msg:151894
 3:17 pm on Aug 12, 2004 (gmt 0)

>Even if it takes a few years to get it right, what do they lose? 10 years of the early web. In the grand scheme of things, I would say that isn't too big of a deal. One hundred years from now, if this is the ONLY surviving archive, and it is missing the first ten years, it will still fall into the "that's unfortunate" category. But the world will still be spinning, and civilization won't be in worse shape for not having the data.

[mediahistory.umn.edu...]

:(

john_k




msg:151895
 3:19 pm on Aug 12, 2004 (gmt 0)

So you believe that Google should SELL (for a profit) their cached copy of the internet - surely no one out there is going say that is "fair use" with respect to copyright.

You are right.

I guess I was groping for a way to politely tell him "no way, we know what you are up to, and this is a moronic request." (Of course I have no way of knowing if that is what folks at Google think of his request - it's just what I think of it.)

Your point of "fair use" restrictions would seem to be enough. If Google were to hand over the data (even for free), they would very likely be sued by multiple publishers over copyright issues.

PCInk




msg:151896
 3:22 pm on Aug 12, 2004 (gmt 0)

Giving or selling, (whichever, it does not matter) this data to a third party in any form would be a breach of copyright. Any webmaster could sue Google for passing on their (the webmasters) data. The webmaster has allowed Google to spider their site (for a two-way benefit, they make money off searches and I get visitors from them) and now Google would be giving or selling that copyright data to another person without each and every webmasters permission.

This would be similar to the record company allowing me to buy one of their CD's (two-way benefit - they make money, I enjoy listening) and then I copy it and give it away, or I copy it and sell it. Either way, it is illegal.

creative craig




msg:151897
 3:26 pm on Aug 12, 2004 (gmt 0)

Good point PCInk, would I have to opt in to say I want my site distributed. If so how do I opt out?

john_k




msg:151898
 3:34 pm on Aug 12, 2004 (gmt 0)

rfgdxm1 - I agree that the destruction of the library in Alexandria was tragic. But we are talking here about something that is quite different. Every web page is not a great scholarly work. Not even most of them. Taken as a whole, the Internet IS a significant (undestatement!) body of knowlege. But Kahle can create his own spider. He can cache much of the web. He can start today. He could have started last year, or whenever he first thought of it. And he isn't the first person to think of it. And nobody is seeking to destroy his library.

Ideally, he could start by spidering the sites indexed in DMOZ. That index is purposely managed by breathing people to offer a diverse selection of websites. It would seem to provide a much sounder foundation upon which he can build his archive of everything. After he has tackled DMOZ he might have a little more credibility to make such a request of Google.

If Google is destroyed by an invading army or by Chapter 13, the rest of the Internet will still be there waiting for Kahle.

Lord Majestic




msg:151899
 3:37 pm on Aug 12, 2004 (gmt 0)

If so how do I opt out?

It might be impossible to achieve that by just wishing for it just like it appears to be impossible to control P2P's distribution of copyrighted music, apart from perhaps having some kind of clever logic that produces different pages depending on what user wants - this can't be indexed easily just like you can't index all the searches done on Google not least due to not knowing all the searches.

If today search engines were limited to what they can do with the data apart from directly hosting it (not just static HTML!) to compete with original sites, then tomorrow it will not be possible to create database of web servers in use like NetCraft's and sell it.

Its not the fact that data is sold, its how data is going to be used - if the guy analyses and builts new search engine then its as fair use as it currently is.

[edited by: Lord_Majestic at 4:05 pm (utc) on Aug. 12, 2004]

kaled




msg:151900
 4:03 pm on Aug 12, 2004 (gmt 0)

I have a feeling in the back of my mind (but I could be mistaken) that book publishers in the UK are obliged to supply a copy of every book they publish to the British library.

It is not practical to require every webmaster to submit their site to a central database, but I don't believe many webmasters would object to a non-profit agency archiving their sites for use by historians in the future PROVIDED that bandwidth issues do not cause problems for the sites in question.

However, if a company were to try to exploit that data in some way for profit, the argument becomes a whole lot more complicated. Perhaps Kahle should be petitioning the Government to extend the remit of the national library rather than attempting this venture himself.

Kaled.

creative craig




msg:151901
 4:24 pm on Aug 12, 2004 (gmt 0)

Its not the fact that data is sold, its how data is going to be used - if the guy analyses and builts new search engine then its as fair use as it currently is.

I would deem it to be unfair if Google were not to ask ther permission of webmasters to distribute the site.

They give an option not to be indexed or cached with the use of Meta tags and robots.txt which you can use if needed...

<meta name="Google" contents="nodistribute, noresale">

the_nerd




msg:151902
 4:31 pm on Aug 12, 2004 (gmt 0)

ogletree,

There database is not what has valuable it is there algo and thee ability to dispense so many SERPS at once fast. Anybody can write a spider and collect the internet. I don't know what his problem is

But not every spider is allowed to spider as many pages a Google's. I'd probably block him because I see no commercial benefit to having somebody sucking in my pages.

nerd

ogletree




msg:151903
 4:33 pm on Aug 12, 2004 (gmt 0)

I did not think of that. I hope he gets it. I think what he has now is a very nice resournce.

Lord Majestic




msg:151904
 4:33 pm on Aug 12, 2004 (gmt 0)

I'd probably block him because I see no commercial benefit to having somebody sucking in my pages.

And what if whoever build that spider creates next Google? You will be at great advantage to be spidered! Hell, you don't want to have one dominant force in the market - monopolies never do good in the long run! Giving chance to other startup search engines you help keeping Google in check.

I understand when bad spider does not obey robots.txt or opens too many connections thus DoSing site, but hey, what you really got to lose? Chances are small but so are your costs of letting some robot get your pages. This Venture Capitalist strategy - lots of junk but one pearl justifies it all.

rfgdxm1




msg:151905
 4:44 pm on Aug 12, 2004 (gmt 0)

>I did not think of that. I hope he gets it. I think what he has now is a very nice resournce.

I can see another advantage of archive.org having Google's data. One problem with archive.org is if the bot is *ever* blocked by robots.txt, ALL previously stored pages are deleted. No problem if this is what the creator of those pages wants. However, this means if I own widgets.org and build a large, informative site there, if 10 years later I die and someone else buys the expired domain name and blocks the archive.org bot, all my own content vanishes.

However, Google currently is keeping a database of expired domains so SEO types can't buy them and use the PR from old links to their advantage. If archive.org has this database of expired domain names, they could use that to delete only the new content if someone buys an expired domain name. The old archived pages created by the former domain owner will still remain.

StupidScript




msg:151906
 5:06 pm on Aug 12, 2004 (gmt 0)

Correct me if I'm wrong, but isn't the web already an open archive?

What I mean is that it's all electronic information accessible from any capable computer in the world at any time. Unless the page content is useless, SOMEbody generally snags a copy for their own use on another server, and usually this activity is multiplied many times over for good content.

Is there some problem with works being "lost forever" on the web? Would this problem be made significantly less by making a new archive of all electronic data and storing it within a single storage system controlled by a single business entity?

What about the WayBackMachine? Not doing a good enough job?

I can understand a desire to digitize currently un-digitized content, like books, paintings, sculptures and such. But re-copying electronic documents onto yet another storage device seems to be the definition of redundancy. If the web exploded tomorrow, would his copies survive to be explored by future generations where the rest of the copies would not?

$ound$ like a $cheme, to me.

This 67 message thread spans 3 pages: 67 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved