Google Books Starts to Become Useful

Short version of this: you can now actually find things in Google Books that you would hope and expect to find. This is partly due to recent changes (I think). Not sure how interesting this is to the SEO/SEM crowd here, but this is a huge change for me.

When Google books made their big beta debut, it was mostly a fizzle. It contributed to my belief that this was not a company whose guiding vision was "do no evil". Basically, they promised us the Harvard library and a massive public domain collection, and what they delivered was a massive collection of affiliate links.

At the time, I tried searching on phrases like "Moby Dick" which returned not a book by Herman Melville, not books about a book by Herman Melville, but books on things like linux shell scripting and business strategy that happened to mention MB. All you got was a snippet with a few words around search term and a buy link. Only on page 3 did I find a book that was actually discussing Melville's book and that, too, was a current publication/affiliate link. I was not impressed.

Well, it seems that Google's public domain catalog is growing. They first announced their public domain catalog [googleblog.blogspot.com] back in August of 2006, but the fact was that there wasn't much there and it never came up in any of my searches.

I'm not sure if they finally arrived at DQ in the LC classification (the books I use the most), or if they started integrating book search with normal search results to a higher degree. I suspect it is both. On July 3, according to the Google Books blog, they made the "text layer" available publicly on public domain books. I suspect that they also made it more integrated into search around the same time. This means that the unsearchable image scans have now become searchable.

See:
[booksearch.blogspot.com...]
[googleblog.blogspot.com...]

So I went I on a bit of a hunt and started using the author and title (inauthor and intitle) searches to look up books that I use most frequently. There was, I have to say, a surprisingly high hit rate for full-text PDF versions considering that these are relatively obscure books (e.g. Amédée Roget, Histoire du peuple de Genève, all seven volumes).

I actually think that this is better than the physical book.

- I can download the entire work, which I have done. So now I own it. This book is occasionally findable on the used market (so it's not a rare book), but it's not easy and not cheap. Generally you have to go to a major research library to use this. Now I go to C:\eBooks\... well you get the idea.

- the physical book is not indexed. The Google book is... sort of. They seem to have put the full text in their database using some sort of OCR which can lead to problems, but just now I typed in the search "Comparet inauthor:Roget" and got the page that details the execution of the brothers Comparet (this officially qualifies as obscure information in the extreme). It brings up a PDF of the page in the original work. Voila, a book that has never had an index, suddenly has one. Even better, it actually highlights the search terms on the image scan (these are not, of course, text PDFs).

Some downsides

All is not perfect in Googleland, however. Of course, Google has automated this process and, as is often the case with Google, seems to have automated it a bit too much.

- there is no hand check, as near as I can tell. This means that one book I downloaded was obviously loaded into the scanner incorrectly and the inside half (the binding side, that is) of the even pages was missing throughout the entire book. Other books have many pages missing. That's unfortunate because those books are no doubt marked as done, and you they are unusable. As near as I can tell, there is no reporting tool for bad scans.

- OCR has it's limits. Search on Ferrière's Science Parfaite des Notaires (turns out not to have been scanned yet), I got results for Perriere, Fernere, Pernere and other issues you would expect with OCR. It's tons better than the first time I tried to use OCR and got pages and pages of nothing but "i" and "n". In fact, I would say it is over 90% accurate, but that last 10% can be a bit bothersome. In other words, looking at the text layer for the search in question:

par la faction pcrriniste. Balthnzar Sept avait pris les devants avec Pierre Verna. f J'ouis, rapporte
par la faction perriniste. Balthazar Sept avait pris les devants avec Pierre Verna. « J'ouis, rapporte

Not too bad anyway, and one of the most remarkable events in the history of search for historians since the Trésor de la Langue Française put their full-text database online in the late 1990s (subscription site only, but most major research institutions subscribe).

Google Books Starts to Become Useful

Text layer became available on July 3.

ergophobe

tedster

BigDave

ergophobe

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week