Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Books Starts to Become Useful

Text layer became available on July 3.

         

ergophobe

4:18 pm on Jul 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Short version of this: you can now actually find things in Google Books that you would hope and expect to find. This is partly due to recent changes (I think). Not sure how interesting this is to the SEO/SEM crowd here, but this is a huge change for me.

When Google books made their big beta debut, it was mostly a fizzle. It contributed to my belief that this was not a company whose guiding vision was "do no evil". Basically, they promised us the Harvard library and a massive public domain collection, and what they delivered was a massive collection of affiliate links.

At the time, I tried searching on phrases like "Moby Dick" which returned not a book by Herman Melville, not books about a book by Herman Melville, but books on things like linux shell scripting and business strategy that happened to mention MB. All you got was a snippet with a few words around search term and a buy link. Only on page 3 did I find a book that was actually discussing Melville's book and that, too, was a current publication/affiliate link. I was not impressed.

Well, it seems that Google's public domain catalog is growing. They first announced their public domain catalog [googleblog.blogspot.com] back in August of 2006, but the fact was that there wasn't much there and it never came up in any of my searches.

I'm not sure if they finally arrived at DQ in the LC classification (the books I use the most), or if they started integrating book search with normal search results to a higher degree. I suspect it is both. On July 3, according to the Google Books blog, they made the "text layer" available publicly on public domain books. I suspect that they also made it more integrated into search around the same time. This means that the unsearchable image scans have now become searchable.

See:
[booksearch.blogspot.com...]
[googleblog.blogspot.com...]

So I went I on a bit of a hunt and started using the author and title (inauthor and intitle) searches to look up books that I use most frequently. There was, I have to say, a surprisingly high hit rate for full-text PDF versions considering that these are relatively obscure books (e.g. Amédée Roget, Histoire du peuple de Genève, all seven volumes).

I actually think that this is better than the physical book.

- I can download the entire work, which I have done. So now I own it. This book is occasionally findable on the used market (so it's not a rare book), but it's not easy and not cheap. Generally you have to go to a major research library to use this. Now I go to C:\eBooks\... well you get the idea.

- the physical book is not indexed. The Google book is... sort of. They seem to have put the full text in their database using some sort of OCR which can lead to problems, but just now I typed in the search "Comparet inauthor:Roget" and got the page that details the execution of the brothers Comparet (this officially qualifies as obscure information in the extreme). It brings up a PDF of the page in the original work. Voila, a book that has never had an index, suddenly has one. Even better, it actually highlights the search terms on the image scan (these are not, of course, text PDFs).

Some downsides

All is not perfect in Googleland, however. Of course, Google has automated this process and, as is often the case with Google, seems to have automated it a bit too much.

- there is no hand check, as near as I can tell. This means that one book I downloaded was obviously loaded into the scanner incorrectly and the inside half (the binding side, that is) of the even pages was missing throughout the entire book. Other books have many pages missing. That's unfortunate because those books are no doubt marked as done, and you they are unusable. As near as I can tell, there is no reporting tool for bad scans.

- OCR has it's limits. Search on Ferrière's Science Parfaite des Notaires (turns out not to have been scanned yet), I got results for Perriere, Fernere, Pernere and other issues you would expect with OCR. It's tons better than the first time I tried to use OCR and got pages and pages of nothing but "i" and "n". In fact, I would say it is over 90% accurate, but that last 10% can be a bit bothersome. In other words, looking at the text layer for the search in question:

par la faction pcrriniste. Balthnzar Sept avait pris les devants avec Pierre Verna. f J'ouis, rapporte

par la faction perriniste. Balthazar Sept avait pris les devants avec Pierre Verna. « J'ouis, rapporte

Not too bad anyway, and one of the most remarkable events in the history of search for historians since the Trésor de la Langue Française put their full-text database online in the late 1990s (subscription site only, but most major research institutions subscribe).

tedster

12:10 am on Jul 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm really glad to hear about this - book search is not something I've used historically, but now I just might. Thanks.

BigDave

12:44 am on Jul 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since I'm going back to school, I've been using it a lot over the last year and find it quite useful.

It has a LONG way to go, but for an early version of a product it was incredibly helpful. More often than not, I would use it to find out what book I needed, then I would request it through summit.

Now could you please explain why having a beta product that still has problems would cause you to even bring up the "don't be evil" thing? Just because it didn't do what you wanted, didn't make them "evil".

ergophobe

5:56 pm on Jul 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Just because it didn't do what you wanted, didn't make them "evil".

That's correct. The evil part was not that their product was not the product I wanted, it was that their product was not the product they announced. Rather, it was a bait and switch. Basically like MFA-based arbitrage.

Amazon says: "We're a bookstore. You can do limited searches on a limited number of our books, but only enough to help you make an informed purchase, because that's how we make money." I have no problem with that. In fact, I love Amazon. They make money off books. I'm fine with that.

Google said: "Come to Google books. We're putting the world's knowledge online. We are making public domain texts available for search."

What I found at the time was basically a vastly inferior version of Amazon with
- almost no public domain books
- no meaningful or useful search results. As I said, how could someone searching on "Moby Dick" possibly be looking for pages and pages of books on Linux shell programming, marketing, etc, and not literature or history?
- was vastly inferior to the Amazon "search inside" feature. You got snippets as short and decontextualized as Google's normal search results. Utterly useless when searching books that you can't click through to for fuller information. That works for web pages, but not for a physical text.
- was far more controversial than Amazon "search inside" for its potential copyright violations.

So basically, it was the dishonest advertising that bothered me. I don't believe that it was necessarily nefarious in intent, but another completely bungled release from Google. They had not, at that time, realized that they basically are Microsoft (the 500-pound gorilla) and that they can't just say "See, this is great, because we're Google." They need to deliver on promises and basically were delivering the web equivalent of vaporware, that old Microsoft specialty product.