Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Is Google correctly indexing very big HTML or PDF documents ?

         

doc_herbst

4:30 pm on Jun 5, 2015 (gmt 0)

10+ Year Member



Hello from Paris, France,

Not being a SEO but a simple webmaster of my own professional blog and a researcher, I am lacking a crucial information when I search the Web :

Does Google index correctly very big documents, be them HTML or PDF ?

So that's two questions in fact :-)

1. Does Google now index an HTML file of any size, and up to its end ?

2. Concerning PDFs, approx. 6 years ago, I heard that Google :
- doesn't know/want to index the whole of a PDF document of, let's say, 200 or 300 pages
- and/or favors in its ranking keywords located in the first pages of the PDF
- and/or doesn't index keywords located near the end of the PDF.

Is this limit about the PDFs still true ? Do you know scientific articles or SEO blog posts about it ?

All my research (including the free pages on WebmasterWorld) yielded only very old answers.

Thanks for your insights.

aakk9999

4:38 pm on Jun 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello doc herbst and welcome to WebmasterWorld!

I have a couple of questions:
- How big do you mean when you say "big"?
- Have you done any testing with this regards, in terms of searching for a phrase in a PDF document which was somewhere towards the end of the document?

doc_herbst

9:01 pm on Jun 5, 2015 (gmt 0)

10+ Year Member



- By big I mean both very long in termes of number of A4 pages and/or in terms of KB. Let's say 1 MB+ for PDFs.

- I did some testing : the answer is Google has in its index a 341, a 1735 or even a 2000+ pages PDF document (with *no* image at all). But :
* a phrase near the end of the 1735 pages PDF is not indexed by Google
* and if you don't limit your search with site: and/or filetype:pdf it's impossible to find the result you're looking for in the 30 first results and in most cases I looked up until the 100th result to no avail.

This impossibility may be caused by indexing and/or ranking limit and/or the difficulty for Google to rank high something lost among the enormous mass of information in a single document -- or else ?

Hence my post in order :
- to be sure and try to distinguish between the causes and know more about it
- to ask the community if anyone knows of a precise limit/length such as the old 101 KB one.

Example 1 :
PDF document to find : [curia.europa.eu...] (341 A4 pages, 1MB)
There is a bibliographical réference in page 41 :
Trujillo Herrera, Raúl. El "hecho diferencial" en el tratado constitucional : el caso de las "regiones ultraperiféricas" / por Raúl Trujillo Herrera C.613.1. Noticias de la Unión Europea. Año XXI (2005), no. 251, p. 5-18.
Now let's say you search for this article and want to know who/which review/site cites it : because you're lazy, you type : Trujillo hecho diferencial constitucional. In vain. Only if you type : Trujillo hecho diferencial constitucional filetype:pdf site:curia.europa.eu will you find the above mentioned PDF. Or you have to type the full title : Trujillo Herrera, Raúl. El "hecho diferencial" en el tratado constitucional : el caso de las "regiones ultraperiféricas" (but it's highly unprobable anyone would do that).

Example 2 :
PDF document to find : [curia.europa.eu...] (1735 A4 pages, 6 MB)
There is a bibliographical réference in page 1718 :
Lafuma Emmanuelle. Harcèlement moral et point de départ du délai de prescription, Revue de jurisprudence sociale 2011 p.756-757
You search for this article and you're not lazy this time, so you type : Lafuma Emmanuelle Harcèlement moral et point de départ du délai de prescription. In vain. Even if you add : filetype:pdf site:curia.europa.eu Google doesn't send back the PDF.

Example 3 :
The reference "Wyrozumska Anna, 1249, 1285" can be found at the end of [curia.europa.eu...] (341 A4 pages, 1MB).
The query Wyrozumska Anna 1249 1285 in Google does not send back the above mentioned PDF ... but this one [curia.europa.eu...] a similar but more recent one and only 177 A4 pages long and 430 KB in size.

lucy24

12:50 am on Jun 6, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if you don't limit your search with site: and/or filetype:pdf it's impossible to find the result you're looking for in the 30 first results and in most cases I looked up until the 100th result to no avail.

Interesting. That's the same thing I found while doing some spot-checking of my own-- except there was no question of "top 30" or "top 100" because they're exceedingly unlikely phrases. In fact it makes me a little anxious that the only way I could make things show up at all was with a site: operator, even while they came up with half a dozen different results from certain other sites. (This is an ebook of a public-domain text, so the duplication is perfectly legitimate. Just weird.) I picked the biggest one for checking; the pdf weighs in at 1.8MB. Something like 350 pages in the original book; I've never bothered to check the PDF's page count. But there doesn't seem to be any difference between material from the beginning and the end.

aristotle

12:40 pm on Jun 6, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



- to ask the community if anyone knows of a precise limit/length such as the old 101 KB one.


I'm pretty sure that Google has raised that limit. I have a 156 kb html page on one site that appears to be fully indexed.

But Bing may have a lower limit because I recently got a warning from Bing Webmaster Tools that that same page exceeds 125 kb and suggested it MIGHT not be fully indexed for that reason. But I haven't got around to investigating that yet.