homepage Welcome to WebmasterWorld Guest from 54.166.53.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google Now Using OCR on Scanned PDF Documents
tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 12:50 am on Oct 31, 2008 (gmt 0)

Interesting change for Google here. They've automated an OCR process that coverts scanned images of documents into indexable and searchable text.

In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format.

[googleblog.blogspot.com...]


 

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 12:56 am on Oct 31, 2008 (gmt 0)

It occurs to me that OCR on other images of text may be in play, too - for example, sites that use images to create headlines in a display font. Automated OCR capability has been there for many years, but we've seen little evidence for any effects..

For example, some major sites put boilerplate disclaimers into an image to avoid indexing problems. I hope OCR doesn't complicate things.

I'm still not about to recommend images instead of text documents, but as a searcher it might give me access to information I've been missing.

Receptional Andy



 
Msg#: 3777195 posted 12:58 am on Oct 31, 2008 (gmt 0)

I hope this works better than the OCR I've become accustomed to ;)

I confess to being somewhat underwhelmed by Google's effort to index "non standard" web content. I dread to think how many times I've searched Google, but in all that time I still shy away from clicking on stuff like Word documents or PDFs. If I'm explicitly looking for that content then it's great, but I'll be looking for [filetype:pdf] [google.com] by that point ;)

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 4:38 am on Nov 1, 2008 (gmt 0)

Some press coverage also broke on October 30 and 31:

Jason Kincaid, a blogger at TechCrunch noted that: "Such technology has existed for quite a while, but accuracy has always been an issue -- and the fact that Google is doing it on such a massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers)."

ComputerWorld Article [computerworld.com]


tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3777195 posted 5:26 am on Nov 1, 2008 (gmt 0)

I offer several pdfs, put in that format for a REASON. Just checked google and see that for two of those they offer ViewHTML for those titles. Looks like farmer's friend. Worse, the "title" is totally fubarred...and was used for the PDF listing! Not liking this at all.

CainIV

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3777195 posted 6:02 am on Nov 1, 2008 (gmt 0)

This could cause tons of problems for many websites.
My sense would be that it would work better in real-life practice with option for webmasters of adding tags that explicitly 'asks' for that content to be indexed.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 6:06 am on Nov 1, 2008 (gmt 0)

Even with this OCR technology trying to improve search, it's still a very good idea to pay attention to the embedded meta data in a PDF file. If you or the people who create onlive PDF documents for you do not know how to do this, it's to learn how to locate and modify the meta-data.

As the linked articles indicate (and your own experience can verify) accurcay in OCR is still a difficult problem. The Google results from this new adventure are most likely not going to be ideal for quite a while. If you don't want mismatched information, or coffeee stains being turned into text, then make sure you take some helpful steps.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3777195 posted 9:18 am on Nov 1, 2008 (gmt 0)

Good insights, tedster, but I don't use Acrobat to make pdfs... a print routine with only tiny bit available. What do guys like me (and probably a zillion others) do? More thoughts welcome, but I guess it is back to school for the Texican as regards PDF and google and if I wanted it html I'd have done it FIRST!

Muttering...

CainIV

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3777195 posted 6:01 pm on Nov 1, 2008 (gmt 0)

"If you or the people who create onlive PDF documents for you do not know how to do this, it's to learn how to locate and modify the meta-data."

Any links to good resources / reads tedster? :)

C

newborn

5+ Year Member



 
Msg#: 3777195 posted 6:18 am on Nov 2, 2008 (gmt 0)

If that is the case when will adsense be available for PDF documents......

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3777195 posted 8:38 am on Nov 2, 2008 (gmt 0)

when will adsense be available for PDF documents...

right after google offers to convert to and/or host all pdf documents "for free"...

sun818

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3777195 posted 7:52 pm on Nov 2, 2008 (gmt 0)

Remember Google Books and Catalog? I wonder if they'll be able to OCR text within images. If I am looking for soup, I recall Google Catalog was able to highlight text within images too!

jimbeetle

WebmasterWorld Senior Member jimbeetle us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 8:32 pm on Nov 2, 2008 (gmt 0)

I wonder if they'll be able to OCR text within images.

Am I misunderstanding, or isn't that basically what this thread is about? Not Google indexing PDF documents, but using OCR on scanned documents -- images -- included within a PDF.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 8:36 pm on Nov 2, 2008 (gmt 0)

Jim, you understand correctly. Google has been indexing the text content of a PDF for a very long time.

sun818

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3777195 posted 2:07 am on Nov 3, 2008 (gmt 0)

I did a search for "Grey Goose". Notice how on the lower right, the word "grey goose" on the bottle is highlighted in yellow? I think it'll be neat of Google will process PDF with embedded images like this:

[catalogs.google.com...]

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 2:35 am on Nov 3, 2008 (gmt 0)

Our policy is not to discuss specific searches here - but we'll make an exception for this one example, since there seems to be some confusion about what this new step is for PDF indexing. It's a clear example of OCR image indexing taken from catalog search. Good original print quality certainly made OCR easier to use in that arena.

ergophobe

WebmasterWorld Administrator ergophobe us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3777195 posted 5:46 pm on Nov 3, 2008 (gmt 0)

I hope this works better than the OCR I've become accustomed to

They've been using this for Google Books, which are images presented in PDF format, for quite some time. Like any OCR, accuracy depends a lot on the font and the text. If the text contains standard dictionary words in a relatively clean font (not old paper with strong serif fonts) the results are decent. Otherwise, poor.

I'm curious that this is news, because PDF scans from Google books have been text-searchable for a long time and you have been able to view the text version for at least a year. It is these text versions that have formed the basis of the book SERPs and, with universal search putting book results in the general results, OCR of text image scans have been showing up in the general search results for quite a while too.

I guess this announcement means that they're expanding that usage beyond Google Books to documents discovered "in the wild"? Or is this something different?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3777195 posted 6:50 pm on Nov 3, 2008 (gmt 0)

In addition, Google has a long history of travelling into directories that are exlcuded (by request) in robots.txt to retrieve PDF files!

I offer several pdfs, put in that format for a REASON. Just checked google and see that for two of those they offer ViewHTML for those titles. Looks like farmer's friend. Worse, the "title" is totally fubarred...and was used for the PDF listing! Not liking this at all.

Nor am I tangor!
I've many files that could easily be viewed as text or html, where that my original intent. That I at least offerring a viewing option in the PDF format is beyond a necessary comprehension and/or explantion reason for Google or any other bot.

Has anybody seen their PDF's that have been password encrypted, being OCR'd by Google and listed in SERPS?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved