Forum Moderators: open
NY Times article [nytimes.com]
Google, the operator of the world's most popular Internet search service, plans to announce an agreement Tuesday with some of the nation's leading research libraries and Oxford University to begin converting their holdings into digital files that would be freely searchable over the Web.It may be only a step on a long road toward the long-predicted global virtual library. But the collaboration of Google and research institutions that also include Harvard, the University of Michigan, Stanford and the New York Public Library is a major stride in an ambitious Internet effort by various parties. The goal is to expand the Web beyond its current valuable, if eclectic, body of material and create a digital card catalog and searchable library for the world's books, scholarly papers and special collections.
Particularly after the Google Scholar announcement, I think it's another significant step forward.
[edited by: vitaplease at 9:12 am (utc) on Dec. 14, 2004]
[edit reason] link to original article [/edit]
And it will be years before the project is complete. At Michigan, for example, the library stacks contain about some 132 miles of books. Google hopes to get the digitization job at UM done in six years, according to John Wilkin, Michigan associate university librarian. "We feel this is part of the mission of a great public university - reaching out to the public with the resources that we have," he says.Google has a self-interest in the project too, of course. These new library holdings could be a signature resource for a search firm facing increasing competition from rivals Yahoo and Microsoft's MSN.
Google executives declined to comment on exactly how they would go about transferring printed pages into the digital realm. But librarians say that the job surely won't be easy. Many institutions have been trying to do just that for years and have proceeded at the pace of a seventh grader plowing through Tolstoy's "War and Peace."
The University of Virginia, for instance, used foundation funding to digitize 800 volumes of early American fiction. It took them 10 years. "These are waters that libraries have been trying to navigate for years," says Mr. Gibson of Virginia.
How much do you want to bet that, in less than a couple of years, these library results will take up the first 1-3 SERPs for every major keyphrase.
Who owns the copyright of those books? the universities? Google? I think not. There must be millions of writers who are not happy about publishing their work for free without asking them.
All of the books that Google is scanning will be those in the public domain. In other words, old.
Very much, perhaps most, of the information in the library (and on the web) that people need and use will not be included in the Google plan. Literature, art, and historical works will be, and Google could be providing significant competition for sites in those fields. However, for most sites this plan will not have any effect.
In general, I think that it is a good thing.
One aspect that I haven't seen commented on yet is this: Google are creating a backup of some of the world's great libraries. Afterwards, a fire at one of these libraries (while still a disaster) - would be a much smaller loss to mankind.
With that in mind, it's surprising that some of the libraries have only agreed to a relatively small number of books being digitised initially.
Hopefully that's just a toe-dipping exercise, and success will encourage fuller participation.
Oh yes - and another big "Thank you" to Google. Nice to see the private sector doing something you might think governments should have done years ago...
Personally, I'm thrilled that I'll be able to study from those libraries without leaving home! This is amazing news. But I wonder, is G going to give us a few months for free, then begin charging a per-page fee?
And how will they handle the sticky issue of copyright law?
How much do you want to bet that, in less than a couple of years, these library results will take up the first 1-3 SERPs for every major keyphrase.
Somehow I don't think the works of Chaucer, early American fiction, or scientific treatises from the Victorian era will clutter up the SERPs for keyphrases like "debt consolidation," "Viagra," or "London hotels." :-)
Remember micropayments? - Neither do I :)
Imagining now .. a book excerpt appears in G-SERPS .. but it's not a public domain item.
G provides a "click to continue reading" link that's tied into your "G credit account" .. G deducts a micropayment for pageview (whatever) .. splits the booty with the publisher/copyright holder.
The promise of micropayments finally realized .. facilitated by "Google-Pay.com" .. (all right, now I'm feely woozy)
Hey GoogleGuy .. don't sell those options just yet ;)
[webmasterworld.com...]
[edited by: vitaplease at 8:25 am (utc) on Dec. 15, 2004]
Who owns the copyright of those books? the universities? Google? I think not. There must be millions of writers who are not happy about publishing their work for free without asking them.
All of the books that Google is scanning will be those in the public domain. In other words, old.
In the article I read there was talk of revenue sharing with those copyright owners who would allow their books into the system. I'd speculate that many writers will be contacted. Of those books already in the public domain it's only a matter of how much Google is willing to pay for adding them. You can be certain there is a financial interest in this venture... philanthropic as it may sound. Nonetheless, my heart throbs wildly at the prospect of perusing this valuable resource. By the way Google, nice work with the Catalogs.
Imagine every page of every book with a column of Ads by Greeeeeedy next to it.
Maybe I'm cynical but I just don't buy the 'Do no evil' mantra anymore. Every thing Google does has $ attached. Moreso even than M$.
And Googleguy, you (or the original Googleguy) used to be helpful. You represented a liaison between the 'Google that did no evil' and the oft at odds webmaster community and you did it superbly well. Nowadays you show up to plug new features and point links to Googleblog like this is the Leno show. Wasssup?
I am VERY skeptical that they will be able to convert the books as quickly as they project. I can see a couple of bottlenecks.
The company I worked for sent books to India, among other places, to be scanned. They were sliced up and run through some kind of feeder. Google will not be able to do this--the libraries want non-destructive scanning, and the Times article notes that they plan to do the scanning near or in the libraries, possibly with pages being turned by hand.
Then you have to do the OCR. To answer tomkee's question: OCR programs are phenomenally accurate with standard fonts on clean white paper. They will not be as accurate with old books. How do you fix the error-laden files? Some automation is possible, by running them through a spell-checker, but if you want reasonably clean files you need to have them proofread, by a human being.
It's a great project, but it will probably take two or three times as long to complete as they expect--or they will have to devote additional resources to it in a year or two when they see that the process is slow.
It's a great project, but it will probably take two or three times as long to complete as they expect--or they will have to devote additional resources to it in a year or two when they see that the process is slow.
Well, if you have to take the opinion of someone with billions of dollars, or someone on an internet messageboard, you take the former. Anyway, haven't you seen their OCR success with Google Catalogs? There are WAY more varieties of fonts/type than they are going to find in these libraries.
However, I think this is just the tip of what's to come. Google News is more instructive. I think book and magazine publishers will start cutting deals with Google (at the publisher level, not the individual author level as was mentioned earlier) to include their current materials in Google, with Google paying them a share of their Adwords $, and in the case of magazines, providing subscription links. Some works might be walled off as part of a paid service, like Questia, and perhaps called something like Google Premium.
This is when things will get interesting -- when current information is available on Google, just like it is already at Yahoo. That is when web sites will need to fear the decline of free traffic from Google. But 100 year old books don't pose much of a threat.
I think book and magazine publishers will start cutting deals with Google (at the publisher level, not the individual author level as was mentioned earlier) to include their current materials in Google, with Google paying them a share of their Adwords $, and in the case of magazines, providing subscription links. Some works might be walled off as part of a paid service, like Questia, and perhaps called something like Google Premium.
Sounds like Compuserve's database archives, circa 1989. :-)
Clark: Amazon apparently sent copies of the books to India and used cheap labor to digitize.BReflection: do you have a citation for that?
Here is the wired article [wired.com]on Amazon's documentation initiave. Amazon released this just about a year ago.
Trying to draw connections with how this will affect Google's ability to search the web only reveals the dramatic difference between the small scope of the tunnel-visioned webmasters out there and the grand scope of Google's mission. Folks here scoffed at the "organize all the world's information" as being inflated corporate-speak fluff, now we all know that Google actually has that intention.
I am excited that this project will happen in my lifetime. Where do I go to volunteer on the weekends to scan books?
As far as OCR concerns go, if you haven't checked out Google Catalogs then you don't know how far they've come with that technology. Old books will be a breeze -- and the original image of the actual printing of the book can (and I predict will) be preserved.
Yes, I know Google Catalogs. Nice work. Much smaller in scope, and they didn't have to worry about non-destructive scanning....
Just to be clear: I'm not saying they can't do it. I'm just saying I think it will take longer than they are projecting. If personal experience doesn't mean anything to you, then go do a little research into any of the other digitization projects out there, completed or ongoing. How many of them were completed on schedule or, if not complete, are currently on schedule?
I'm not saying they can't do it. I'm just saying I think it will take longer than they are projecting.
The only projections so far are from third party analysts and reporters, from what I can tell after reading the press release [google.com], the blog entry [google.com], and an article or two. Since Google hasn't made any claims about how long they think it will take, I'd guess they know that it could range from a few years to a decade or more and don't want to make difficult predictions at this point.
Google director of project management Susan Wojcicki declined to say how much the project would cost and how long it would take.
The American Library Association, founded in 1876, has been fighting the Justice Department over the Patriot Act. It is not an insignificant organization. There is a long tradition among professional librarians of respecting intellectual freedom and freedom of political thought. I think we can expect the ALA, once they study the issue of Google's tracking using a single cookie across all their services, which expires in 2038 and has a unique ID in it, to have something to say on how libraries should approach offers from Google and similar engines.
There is no equivalent history of public-sphere responsiveness from Google, and this is where Google falls on its face. What if Google is served a subpoena by the FBI ten years from now, because the FBI wants a list of IP addresses of those who have been reading Karl Marx, along with the chapter and verse accessed by the person, and the date/time stamp, and all search terms used by that person within in the last 30 days, and all their Gmail correspondence that the person thought they had erased? This is state-of-mind evidence sufficient to prove intent to a jury. It's not nearly as vague as records of titles borrowed from a library.
I know what most librarians would say. They'd say that since the Patriot Act of 2001, the ALA has recommended that they don't keep borrowing records any longer than necessary. It's not a crime to not keep the records, but it is a felony to lie to the FBI. If you have the records, you have to hand them over or be guilty of obstruction of evidence.
Look to the ALA to recommend that all contracts between libraries and search engines be written so as to guarantee the anonymity of those who access the material digitized from that library. Otherwise, the ALA might recommend withholding all books and documents with political content or that have political relevance. That would include some of the best political and anarchist philosophy of the 19th century, by the way, which is by now all in the public domain.
What if Google is served a subpoena by the FBI ten years from now, because the FBI wants a list of IP addresses of those who have been reading Karl Marx
FUD. What if the FBI served subpoenas today to discover who'd been searching for Karl Marx on all the search engines? Your concerns aren't relevant to this discussion because the scenario in which such a thing would occur is not connected with the particular format of media indexed by Google or any search engine.
If it would be a problem with searching books in ten years, it'd already be a problem with searching web sites today. Political and anarchist sites abound, adding earlier print works to the mix won't unbalance the universe or destroy privacy.
The ALA should rightly see this as an opportunity to make all published works more freely available to more people. If the most important concern was protecting readers from an imagined future fascist goverment, then the ALA would recommend burning all books that would be dangerous to have on anyone's "reading list."
2) There is no way to address the problem today except to post on SEO forums and get denounced by pro-Google webmasters with a private interest in promoting ecommerce.
3) With Google approaching libraries, there is, perhaps for the first time, a new political factor involved. The ALA is organized and has influence within the library profession. Google has never, ever had to deal with a powerful organization that has a history of social concern and responsibility.
4) So why not get the ALA involved at this point? You don't make a problem go away by making it many times larger!
The NYT says, "At Stanford, Google hopes to be able to scan 50,000 pages a day within the month, eventually doubling that rate, according to a person involved in the project."
Stanford has 8 million volumes. At the University of Michigan, the librarian involved with this project calculates an average of 340 pages per volume. Let's assume 340 for Stanford also.
8 million times 340 equals 2,720,000,000.
At 100,000 pages per day, it will take Google 27,200 days to do Stanford.
That comes to 74.47 years, wherease press reports are estimating ten years for Stanford. At Michigan, which has 7 million volumes, the librarian involved says it will take just six years.
GoogleGuy, why don't you just tell us what your anticipated scan rate is for Stanford and for Michigan? If it's going to take 74 years, I might not want to buy any stock, despite the hype I've read over the last few days.