Welcome to WebmasterWorld Guest from 184.72.177.182

Message Too Old, No Replies

Google Now Using OCR on Scanned PDF Documents

     
12:50 am on Oct 31, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Interesting change for Google here. They've automated an OCR process that coverts scanned images of documents into indexable and searchable text.

In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format.

[googleblog.blogspot.com...]

12:56 am on Oct 31, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


It occurs to me that OCR on other images of text may be in play, too - for example, sites that use images to create headlines in a display font. Automated OCR capability has been there for many years, but we've seen little evidence for any effects..

For example, some major sites put boilerplate disclaimers into an image to avoid indexing problems. I hope OCR doesn't complicate things.

I'm still not about to recommend images instead of text documents, but as a searcher it might give me access to information I've been missing.

12:58 am on Oct 31, 2008 (gmt 0)

Senior Member

joined:Jan 27, 2003
posts:2534
votes: 0


I hope this works better than the OCR I've become accustomed to ;)

I confess to being somewhat underwhelmed by Google's effort to index "non standard" web content. I dread to think how many times I've searched Google, but in all that time I still shy away from clicking on stuff like Word documents or PDFs. If I'm explicitly looking for that content then it's great, but I'll be looking for [filetype:pdf] [google.com] by that point ;)

4:38 am on Nov 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Some press coverage also broke on October 30 and 31:

Jason Kincaid, a blogger at TechCrunch noted that: "Such technology has existed for quite a while, but accuracy has always been an issue -- and the fact that Google is doing it on such a massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers)."

ComputerWorld Article [computerworld.com]

5:26 am on Nov 1, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6163
votes: 284


I offer several pdfs, put in that format for a REASON. Just checked google and see that for two of those they offer ViewHTML for those titles. Looks like farmer's friend. Worse, the "title" is totally fubarred...and was used for the PDF listing! Not liking this at all.
6:02 am on Nov 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 19, 2004
posts:1939
votes: 0


This could cause tons of problems for many websites.
My sense would be that it would work better in real-life practice with option for webmasters of adding tags that explicitly 'asks' for that content to be indexed.
6:06 am on Nov 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Even with this OCR technology trying to improve search, it's still a very good idea to pay attention to the embedded meta data in a PDF file. If you or the people who create onlive PDF documents for you do not know how to do this, it's to learn how to locate and modify the meta-data.

As the linked articles indicate (and your own experience can verify) accurcay in OCR is still a difficult problem. The Google results from this new adventure are most likely not going to be ideal for quite a while. If you don't want mismatched information, or coffeee stains being turned into text, then make sure you take some helpful steps.

9:18 am on Nov 1, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6163
votes: 284


Good insights, tedster, but I don't use Acrobat to make pdfs... a print routine with only tiny bit available. What do guys like me (and probably a zillion others) do? More thoughts welcome, but I guess it is back to school for the Texican as regards PDF and google and if I wanted it html I'd have done it FIRST!

Muttering...

6:01 pm on Nov 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 19, 2004
posts:1939
votes: 0


"If you or the people who create onlive PDF documents for you do not know how to do this, it's to learn how to locate and modify the meta-data."

Any links to good resources / reads tedster? :)

C

6:18 am on Nov 2, 2008 (gmt 0)

Preferred Member

5+ Year Member

joined:Oct 9, 2006
posts:375
votes: 0


If that is the case when will adsense be available for PDF documents......
8:38 am on Nov 2, 2008 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10544
votes: 8


when will adsense be available for PDF documents...

right after google offers to convert to and/or host all pdf documents "for free"...
7:52 pm on Nov 2, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 16, 2001
posts: 2006
votes: 0


Remember Google Books and Catalog? I wonder if they'll be able to OCR text within images. If I am looking for soup, I recall Google Catalog was able to highlight text within images too!
8:32 pm on Nov 2, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 26, 2002
posts:3292
votes: 6


I wonder if they'll be able to OCR text within images.

Am I misunderstanding, or isn't that basically what this thread is about? Not Google indexing PDF documents, but using OCR on scanned documents -- images -- included within a PDF.
8:36 pm on Nov 2, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Jim, you understand correctly. Google has been indexing the text content of a PDF for a very long time.
2:07 am on Nov 3, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 16, 2001
posts:2006
votes: 0


I did a search for "Grey Goose". Notice how on the lower right, the word "grey goose" on the bottle is highlighted in yellow? I think it'll be neat of Google will process PDF with embedded images like this:

[catalogs.google.com...]

2:35 am on Nov 3, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Our policy is not to discuss specific searches here - but we'll make an exception for this one example, since there seems to be some confusion about what this new step is for PDF indexing. It's a clear example of OCR image indexing taken from catalog search. Good original print quality certainly made OCR easier to use in that arena.
5:46 pm on Nov 3, 2008 (gmt 0)

Moderator

WebmasterWorld Administrator ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8139
votes: 103


I hope this works better than the OCR I've become accustomed to

They've been using this for Google Books, which are images presented in PDF format, for quite some time. Like any OCR, accuracy depends a lot on the font and the text. If the text contains standard dictionary words in a relatively clean font (not old paper with strong serif fonts) the results are decent. Otherwise, poor.

I'm curious that this is news, because PDF scans from Google books have been text-searchable for a long time and you have been able to view the text version for at least a year. It is these text versions that have formed the basis of the book SERPs and, with universal search putting book results in the general results, OCR of text image scans have been showing up in the general search results for quite a while too.

I guess this announcement means that they're expanding that usage beyond Google Books to documents discovered "in the wild"? Or is this something different?

6:50 pm on Nov 3, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


In addition, Google has a long history of travelling into directories that are exlcuded (by request) in robots.txt to retrieve PDF files!

I offer several pdfs, put in that format for a REASON. Just checked google and see that for two of those they offer ViewHTML for those titles. Looks like farmer's friend. Worse, the "title" is totally fubarred...and was used for the PDF listing! Not liking this at all.

Nor am I tangor!
I've many files that could easily be viewed as text or html, where that my original intent. That I at least offerring a viewing option in the PDF format is beyond a necessary comprehension and/or explantion reason for Google or any other bot.

Has anybody seen their PDF's that have been password encrypted, being OCR'd by Google and listed in SERPS?