Forum Moderators: open
The thing is I would like to understand how they do it. All these high quality links are from PDF files - so my analysis tools tells me. But when visiting the PDF's I cannot find the specific the link - also it would not make sense if my competitor had a link in these PDF's.
Obviously something sneaky is going on. But what? Anyone know anything about this?
/webjuice
But Google started using technology called optical character recognition ( OCR ) to extract text out of the PDF’s from late 2008 onwards.
What it basically does is that it takes the snapshots of PDF’s as input, runs optical character recognition on them and index the text just like regular text.
If it can see the text, it would be seeing the links too?
If you want to know geek details about the open source OCR software that Google sponsers, OCROPUS –
refer to: [code.google.com...]
(If you have Acrobat Pro 9, you can see the option under Documents => OCR Text Recognition => Recognize Text using OCR)
-AD
In the following search, replace example.com with the domain name in question.
linkdomain:example.com site:.gov
Does it show links from those PDF files?