Google adds major libraries to its database

Forum Moderators: open

Message Too Old, No Replies

Google adds major libraries to its database

hot on the heels of the Google Scholar announcement

Robert Charlton

9:05 am on Dec 14, 2004 (gmt 0)

Google adds major libraries to its database

NY Times article [nytimes.com]

Google, the operator of the world's most popular Internet search service, plans to announce an agreement Tuesday with some of the nation's leading research libraries and Oxford University to begin converting their holdings into digital files that would be freely searchable over the Web.
It may be only a step on a long road toward the long-predicted global virtual library. But the collaboration of Google and research institutions that also include Harvard, the University of Michigan, Stanford and the New York Public Library is a major stride in an ambitious Internet effort by various parties. The goal is to expand the Web beyond its current valuable, if eclectic, body of material and create a digital card catalog and searchable library for the world's books, scholarly papers and special collections.

Particularly after the Google Scholar announcement, I think it's another significant step forward.

[edited by: vitaplease at 9:12 am (utc) on Dec. 14, 2004]
[edit reason] link to original article [/edit]

Mr Bo Jangles

10:25 pm on Dec 14, 2004 (gmt 0)

Gee I find this fascinating - the enormity of it all:
from the Christian Science Monitor "132 miles of books" in one library alone!...

And it will be years before the project is complete. At Michigan, for example, the library stacks contain about some 132 miles of books. Google hopes to get the digitization job at UM done in six years, according to John Wilkin, Michigan associate university librarian. "We feel this is part of the mission of a great public university - reaching out to the public with the resources that we have," he says.
Google has a self-interest in the project too, of course. These new library holdings could be a signature resource for a search firm facing increasing competition from rivals Yahoo and Microsoft's MSN.
Google executives declined to comment on exactly how they would go about transferring printed pages into the digital realm. But librarians say that the job surely won't be easy. Many institutions have been trying to do just that for years and have proceeded at the pace of a seventh grader plowing through Tolstoy's "War and Peace."
The University of Virginia, for instance, used foundation funding to digitize 800 volumes of early American fiction. It took them 10 years. "These are waters that libraries have been trying to navigate for years," says Mr. Gibson of Virginia.

Clark

10:26 pm on Dec 14, 2004 (gmt 0)

Amazon apparently sent copipes of the books to India and used cheap labor to digitize

BReflection

10:54 pm on Dec 14, 2004 (gmt 0)

Amazon apparently sent copipes of the books to India and used cheap labor to digitize

Just curious, but do you have a citation for that?

Clark

11:39 pm on Dec 14, 2004 (gmt 0)

I don't. Read a long article about Amazon's book scanning campaign. And remembered that it was outsourced to India.

skipfactor

11:57 pm on Dec 14, 2004 (gmt 0)

>>This is a sizable commitment of time and effort

Nice 2-minute piece on the CBS Evening News. They report 10 years, $150 million, and ended with something like, "the largest change in the library system for the next 500 years...".

ControlEngineer

1:31 am on Dec 15, 2004 (gmt 0)

How much do you want to bet that, in less than a couple of years, these library results will take up the first 1-3 SERPs for every major keyphrase.

Who owns the copyright of those books? the universities? Google? I think not. There must be millions of writers who are not happy about publishing their work for free without asking them.

All of the books that Google is scanning will be those in the public domain. In other words, old.

Very much, perhaps most, of the information in the library (and on the web) that people need and use will not be included in the Google plan. Literature, art, and historical works will be, and Google could be providing significant competition for sites in those fields. However, for most sites this plan will not have any effect.

In general, I think that it is a good thing.

7_Driver

3:34 am on Dec 15, 2004 (gmt 0)

This has to be a good thing for mankind. More content in the Google index will make things a little tougher for SEOs - but in the great scheme of things I guess that's a price worth paying :-)

One aspect that I haven't seen commented on yet is this: Google are creating a backup of some of the world's great libraries. Afterwards, a fire at one of these libraries (while still a disaster) - would be a much smaller loss to mankind.

With that in mind, it's surprising that some of the libraries have only agreed to a relatively small number of books being digitised initially.

Hopefully that's just a toe-dipping exercise, and success will encourage fuller participation.

Oh yes - and another big "Thank you" to Google. Nice to see the private sector doing something you might think governments should have done years ago...

httpwebwitch

4:15 am on Dec 15, 2004 (gmt 0)

we who benefit from web-based advertising might want to get some books into those libraries, pronto. :)

Personally, I'm thrilled that I'll be able to study from those libraries without leaving home! This is amazing news. But I wonder, is G going to give us a few months for free, then begin charging a per-page fee?

And how will they handle the sticky issue of copyright law?

europeforvisitors

5:51 am on Dec 15, 2004 (gmt 0)

How much do you want to bet that, in less than a couple of years, these library results will take up the first 1-3 SERPs for every major keyphrase.

Somehow I don't think the works of Chaucer, early American fiction, or scientific treatises from the Victorian era will clutter up the SERPs for keyphrases like "debt consolidation," "Viagra," or "London hotels." :-)

dmedia

7:59 am on Dec 15, 2004 (gmt 0)

Here's another interesting "possibility"

Remember micropayments? - Neither do I :)

Imagining now .. a book excerpt appears in G-SERPS .. but it's not a public domain item.

G provides a "click to continue reading" link that's tied into your "G credit account" .. G deducts a micropayment for pageview (whatever) .. splits the booty with the publisher/copyright holder.

The promise of micropayments finally realized .. facilitated by "Google-Pay.com" .. (all right, now I'm feely woozy)

Hey GoogleGuy .. don't sell those options just yet ;)

tomkee

8:18 am on Dec 15, 2004 (gmt 0)

For search of scanned books to work, they must use OCR software. Any idea as to how good the state of the art in OCR is? How reliably does it convert information into text? Especially in the case of unusual fonts and busy backgrounds, mathematical symbols, unusual words....

vitaplease

8:23 am on Dec 15, 2004 (gmt 0)

google in the year 2009

[webmasterworld.com...]

[edited by: vitaplease at 8:25 am (utc) on Dec. 15, 2004]

grandpa

8:23 am on Dec 15, 2004 (gmt 0)

Who owns the copyright of those books? the universities? Google? I think not. There must be millions of writers who are not happy about publishing their work for free without asking them.

All of the books that Google is scanning will be those in the public domain. In other words, old.

In the article I read there was talk of revenue sharing with those copyright owners who would allow their books into the system. I'd speculate that many writers will be contacted. Of those books already in the public domain it's only a matter of how much Google is willing to pay for adding them. You can be certain there is a financial interest in this venture... philanthropic as it may sound. Nonetheless, my heart throbs wildly at the prospect of perusing this valuable resource. By the way Google, nice work with the Catalogs.

Powdork

8:51 am on Dec 15, 2004 (gmt 0)

When you search google scholar beta (even for debt consolidation) there are no adwords. When it comes out of beta I am reasonably certain there will be.

Imagine every page of every book with a column of Ads by Greeeeeedy next to it.
Maybe I'm cynical but I just don't buy the 'Do no evil' mantra anymore. Every thing Google does has $ attached. Moreso even than M$.

And Googleguy, you (or the original Googleguy) used to be helpful. You represented a liaison between the 'Google that did no evil' and the oft at odds webmaster community and you did it superbly well. Nowadays you show up to plug new features and point links to Googleblog like this is the Leno show. Wasssup?

hunderdown

3:21 pm on Dec 15, 2004 (gmt 0)

I have a few points to make about the scanning and digitization process, based on experience: I worked for a small company a couple of years ago creating ebook versions of print books. The company ran out of money but the experience was interesting.

I am VERY skeptical that they will be able to convert the books as quickly as they project. I can see a couple of bottlenecks.

The company I worked for sent books to India, among other places, to be scanned. They were sliced up and run through some kind of feeder. Google will not be able to do this--the libraries want non-destructive scanning, and the Times article notes that they plan to do the scanning near or in the libraries, possibly with pages being turned by hand.

Then you have to do the OCR. To answer tomkee's question: OCR programs are phenomenally accurate with standard fonts on clean white paper. They will not be as accurate with old books. How do you fix the error-laden files? Some automation is possible, by running them through a spell-checker, but if you want reasonably clean files you need to have them proofread, by a human being.

It's a great project, but it will probably take two or three times as long to complete as they expect--or they will have to devote additional resources to it in a year or two when they see that the process is slow.

BReflection

4:46 pm on Dec 15, 2004 (gmt 0)

It's a great project, but it will probably take two or three times as long to complete as they expect--or they will have to devote additional resources to it in a year or two when they see that the process is slow.

Well, if you have to take the opinion of someone with billions of dollars, or someone on an internet messageboard, you take the former. Anyway, haven't you seen their OCR success with Google Catalogs? There are WAY more varieties of fonts/type than they are going to find in these libraries.

turtle1776

5:35 pm on Dec 15, 2004 (gmt 0)

I think this particular initiative is over hyped because, as was mentioned earlier, they will only have access to older works that are out of copyright. If Google started showing 100 year old materials at the top of their SERPs on a regular basis, no one would use Google anymore.

However, I think this is just the tip of what's to come. Google News is more instructive. I think book and magazine publishers will start cutting deals with Google (at the publisher level, not the individual author level as was mentioned earlier) to include their current materials in Google, with Google paying them a share of their Adwords $, and in the case of magazines, providing subscription links. Some works might be walled off as part of a paid service, like Questia, and perhaps called something like Google Premium.

This is when things will get interesting -- when current information is available on Google, just like it is already at Yahoo. That is when web sites will need to fear the decline of free traffic from Google. But 100 year old books don't pose much of a threat.

europeforvisitors

6:07 pm on Dec 15, 2004 (gmt 0)

I think book and magazine publishers will start cutting deals with Google (at the publisher level, not the individual author level as was mentioned earlier) to include their current materials in Google, with Google paying them a share of their Adwords $, and in the case of magazines, providing subscription links. Some works might be walled off as part of a paid service, like Questia, and perhaps called something like Google Premium.

Sounds like Compuserve's database archives, circa 1989. :-)

AaronL

7:59 pm on Dec 15, 2004 (gmt 0)

Clark: Amazon apparently sent copies of the books to India and used cheap labor to digitize.
BReflection: do you have a citation for that?

Here is the wired article [wired.com]on Amazon's documentation initiave. Amazon released this just about a year ago.

whoisgregg

8:00 pm on Dec 15, 2004 (gmt 0)

This project has nothing to do with the Web SERPs that we concern ourselves with. This is the logical expansion of the Google print program on a scale that makes it one of the most significant steps toward the long term preservation of all recorded human knowledge.

Trying to draw connections with how this will affect Google's ability to search the web only reveals the dramatic difference between the small scope of the tunnel-visioned webmasters out there and the grand scope of Google's mission. Folks here scoffed at the "organize all the world's information" as being inflated corporate-speak fluff, now we all know that Google actually has that intention.

I am excited that this project will happen in my lifetime. Where do I go to volunteer on the weekends to scan books?

As far as OCR concerns go, if you haven't checked out Google Catalogs then you don't know how far they've come with that technology. Old books will be a breeze -- and the original image of the actual printing of the book can (and I predict will) be preserved.

hunderdown

9:20 pm on Dec 15, 2004 (gmt 0)

Thanks, breflection. Good to understand where you're coming from. Not swayed by personal experience, but definitely swayed by a big company's press releases. I guess we should all stop posting here and just let the moderators post news clippings and press releases.

Yes, I know Google Catalogs. Nice work. Much smaller in scope, and they didn't have to worry about non-destructive scanning....

Just to be clear: I'm not saying they can't do it. I'm just saying I think it will take longer than they are projecting. If personal experience doesn't mean anything to you, then go do a little research into any of the other digitization projects out there, completed or ongoing. How many of them were completed on schedule or, if not complete, are currently on schedule?

whoisgregg

10:02 pm on Dec 15, 2004 (gmt 0)

I'm not saying they can't do it. I'm just saying I think it will take longer than they are projecting.

The only projections so far are from third party analysts and reporters, from what I can tell after reading the press release [google.com], the blog entry [google.com], and an article or two. Since Google hasn't made any claims about how long they think it will take, I'd guess they know that it could range from a few years to a decade or more and don't want to make difficult predictions at this point.

Google director of project management Susan Wojcicki declined to say how much the project would cost and how long it would take.

[news.com.au...]

whoisgregg

10:04 pm on Dec 15, 2004 (gmt 0)

non-destructive scanning....

An excellent point, this probably destroys my hope that they could ever use "off-the-street" volunteers. :(

Scarecrow

4:05 am on Dec 16, 2004 (gmt 0)

Just because all the reporting during the first two days of this announcement has a "Gee Whiz" quality to it, do not assume that Google has scored some kind of coup.

The American Library Association, founded in 1876, has been fighting the Justice Department over the Patriot Act. It is not an insignificant organization. There is a long tradition among professional librarians of respecting intellectual freedom and freedom of political thought. I think we can expect the ALA, once they study the issue of Google's tracking using a single cookie across all their services, which expires in 2038 and has a unique ID in it, to have something to say on how libraries should approach offers from Google and similar engines.

There is no equivalent history of public-sphere responsiveness from Google, and this is where Google falls on its face. What if Google is served a subpoena by the FBI ten years from now, because the FBI wants a list of IP addresses of those who have been reading Karl Marx, along with the chapter and verse accessed by the person, and the date/time stamp, and all search terms used by that person within in the last 30 days, and all their Gmail correspondence that the person thought they had erased? This is state-of-mind evidence sufficient to prove intent to a jury. It's not nearly as vague as records of titles borrowed from a library.

I know what most librarians would say. They'd say that since the Patriot Act of 2001, the ALA has recommended that they don't keep borrowing records any longer than necessary. It's not a crime to not keep the records, but it is a felony to lie to the FBI. If you have the records, you have to hand them over or be guilty of obstruction of evidence.

Look to the ALA to recommend that all contracts between libraries and search engines be written so as to guarantee the anonymity of those who access the material digitized from that library. Otherwise, the ALA might recommend withholding all books and documents with political content or that have political relevance. That would include some of the best political and anarchist philosophy of the 19th century, by the way, which is by now all in the public domain.

whoisgregg

5:44 pm on Dec 16, 2004 (gmt 0)

What if Google is served a subpoena by the FBI ten years from now, because the FBI wants a list of IP addresses of those who have been reading Karl Marx

FUD. What if the FBI served subpoenas today to discover who'd been searching for Karl Marx on all the search engines? Your concerns aren't relevant to this discussion because the scenario in which such a thing would occur is not connected with the particular format of media indexed by Google or any search engine.

If it would be a problem with searching books in ten years, it'd already be a problem with searching web sites today. Political and anarchist sites abound, adding earlier print works to the mix won't unbalance the universe or destroy privacy.

The ALA should rightly see this as an opportunity to make all published works more freely available to more people. If the most important concern was protecting readers from an imagined future fascist goverment, then the ALA would recommend burning all books that would be dangerous to have on anyone's "reading list."

Scarecrow

7:43 pm on Dec 16, 2004 (gmt 0)

1) It is a problem today with web searching, and it could very likely become a growing problem in coming years. Read up on what's been happening over the last three years with government surveillance.

2) There is no way to address the problem today except to post on SEO forums and get denounced by pro-Google webmasters with a private interest in promoting ecommerce.

3) With Google approaching libraries, there is, perhaps for the first time, a new political factor involved. The ALA is organized and has influence within the library profession. Google has never, ever had to deal with a powerful organization that has a history of social concern and responsibility.

4) So why not get the ALA involved at this point? You don't make a problem go away by making it many times larger!

creepychris

5:48 am on Dec 17, 2004 (gmt 0)

Not entirely new. The Gutenberg Project already has over 13,000 public domain books on the web. A great site, but I have seen it time and time again used for shady SEO: public domain works become the 'content' for a quick 20,000 page site.

whoisgregg

8:31 pm on Dec 17, 2004 (gmt 0)

2) There is no way to address the problem today except to post on SEO forums and get denounced by pro-Google webmasters with a private interest in promoting ecommerce.

Even responding would be going to far into the politics for me to be comfortably in the bounds of the TOS here. Have a nice day. :)

Scarecrow

1:10 pm on Dec 19, 2004 (gmt 0)

Either the NYT screwed up a page one story, or Google isn't doing the math right.

The NYT says, "At Stanford, Google hopes to be able to scan 50,000 pages a day within the month, eventually doubling that rate, according to a person involved in the project."

Stanford has 8 million volumes. At the University of Michigan, the librarian involved with this project calculates an average of 340 pages per volume. Let's assume 340 for Stanford also.

8 million times 340 equals 2,720,000,000.

At 100,000 pages per day, it will take Google 27,200 days to do Stanford.

That comes to 74.47 years, wherease press reports are estimating ten years for Stanford. At Michigan, which has 7 million volumes, the librarian involved says it will take just six years.

GoogleGuy, why don't you just tell us what your anticipated scan rate is for Stanford and for Michigan? If it's going to take 74 years, I might not want to buy any stock, despite the hype I've read over the last few days.

This 59 message thread spans 2 pages: 59