Google Now Using OCR on Scanned PDF Documents

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Now Using OCR on Scanned PDF Documents

tedster

12:50 am on Oct 31, 2008 (gmt 0)

Interesting change for Google here. They've automated an OCR process that coverts scanned images of documents into indexable and searchable text.

In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format.
[googleblog.blogspot.com...]

tedster

12:56 am on Oct 31, 2008 (gmt 0)

It occurs to me that OCR on other images of text may be in play, too - for example, sites that use images to create headlines in a display font. Automated OCR capability has been there for many years, but we've seen little evidence for any effects..

For example, some major sites put boilerplate disclaimers into an image to avoid indexing problems. I hope OCR doesn't complicate things.

I'm still not about to recommend images instead of text documents, but as a searcher it might give me access to information I've been missing.

Receptional Andy

12:58 am on Oct 31, 2008 (gmt 0)

I hope this works better than the OCR I've become accustomed to ;)

I confess to being somewhat underwhelmed by Google's effort to index "non standard" web content. I dread to think how many times I've searched Google, but in all that time I still shy away from clicking on stuff like Word documents or PDFs. If I'm explicitly looking for that content then it's great, but I'll be looking for [filetype:pdf] [google.com] by that point ;)

tedster

4:38 am on Nov 1, 2008 (gmt 0)

Some press coverage also broke on October 30 and 31:

Jason Kincaid, a blogger at TechCrunch noted that: "Such technology has existed for quite a while, but accuracy has always been an issue -- and the fact that Google is doing it on such a massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers)."
ComputerWorld Article [computerworld.com]

tangor

5:26 am on Nov 1, 2008 (gmt 0)

I offer several pdfs, put in that format for a REASON. Just checked google and see that for two of those they offer ViewHTML for those titles. Looks like farmer's friend. Worse, the "title" is totally fubarred...and was used for the PDF listing! Not liking this at all.

CainIV

6:02 am on Nov 1, 2008 (gmt 0)

This could cause tons of problems for many websites.
My sense would be that it would work better in real-life practice with option for webmasters of adding tags that explicitly 'asks' for that content to be indexed.

tedster

6:06 am on Nov 1, 2008 (gmt 0)

Even with this OCR technology trying to improve search, it's still a very good idea to pay attention to the embedded meta data in a PDF file. If you or the people who create onlive PDF documents for you do not know how to do this, it's to learn how to locate and modify the meta-data.

As the linked articles indicate (and your own experience can verify) accurcay in OCR is still a difficult problem. The Google results from this new adventure are most likely not going to be ideal for quite a while. If you don't want mismatched information, or coffeee stains being turned into text, then make sure you take some helpful steps.

tangor

9:18 am on Nov 1, 2008 (gmt 0)

Good insights, tedster, but I don't use Acrobat to make pdfs... a print routine with only tiny bit available. What do guys like me (and probably a zillion others) do? More thoughts welcome, but I guess it is back to school for the Texican as regards PDF and google and if I wanted it html I'd have done it FIRST!

Muttering...

CainIV

6:01 pm on Nov 1, 2008 (gmt 0)

"If you or the people who create onlive PDF documents for you do not know how to do this, it's to learn how to locate and modify the meta-data."

Any links to good resources / reads tedster? :)

newborn

6:18 am on Nov 2, 2008 (gmt 0)

If that is the case when will adsense be available for PDF documents......

phranque

8:38 am on Nov 2, 2008 (gmt 0)

when will adsense be available for PDF documents...

right after google offers to convert to and/or host all pdf documents "for free"...

sun818

7:52 pm on Nov 2, 2008 (gmt 0)

Remember Google Books and Catalog? I wonder if they'll be able to OCR text within images. If I am looking for soup, I recall Google Catalog was able to highlight text within images too!

jimbeetle

8:32 pm on Nov 2, 2008 (gmt 0)

I wonder if they'll be able to OCR text within images.

Am I misunderstanding, or isn't that basically what this thread is about? Not Google indexing PDF documents, but using OCR on scanned documents -- images -- included within a PDF.

tedster

8:36 pm on Nov 2, 2008 (gmt 0)

Jim, you understand correctly. Google has been indexing the text content of a PDF for a very long time.

sun818

2:07 am on Nov 3, 2008 (gmt 0)

I did a search for "Grey Goose". Notice how on the lower right, the word "grey goose" on the bottle is highlighted in yellow? I think it'll be neat of Google will process PDF with embedded images like this:

[catalogs.google.com...]

tedster

2:35 am on Nov 3, 2008 (gmt 0)

Our policy is not to discuss specific searches here - but we'll make an exception for this one example, since there seems to be some confusion about what this new step is for PDF indexing. It's a clear example of OCR image indexing taken from catalog search. Good original print quality certainly made OCR easier to use in that arena.

ergophobe

5:46 pm on Nov 3, 2008 (gmt 0)

I hope this works better than the OCR I've become accustomed to

They've been using this for Google Books, which are images presented in PDF format, for quite some time. Like any OCR, accuracy depends a lot on the font and the text. If the text contains standard dictionary words in a relatively clean font (not old paper with strong serif fonts) the results are decent. Otherwise, poor.

I'm curious that this is news, because PDF scans from Google books have been text-searchable for a long time and you have been able to view the text version for at least a year. It is these text versions that have formed the basis of the book SERPs and, with universal search putting book results in the general results, OCR of text image scans have been showing up in the general search results for quite a while too.

I guess this announcement means that they're expanding that usage beyond Google Books to documents discovered "in the wild"? Or is this something different?

wilderness

6:50 pm on Nov 3, 2008 (gmt 0)

In addition, Google has a long history of travelling into directories that are exlcuded (by request) in robots.txt to retrieve PDF files!

I offer several pdfs, put in that format for a REASON. Just checked google and see that for two of those they offer ViewHTML for those titles. Looks like farmer's friend. Worse, the "title" is totally fubarred...and was used for the PDF listing! Not liking this at all.

Nor am I tangor!
I've many files that could easily be viewed as text or html, where that my original intent. That I at least offerring a viewing option in the PDF format is beyond a necessary comprehension and/or explantion reason for Google or any other bot.

Has anybody seen their PDF's that have been password encrypted, being OCR'd by Google and listed in SERPS?