Inferences from Googlebot Page Download Stats in WMT

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Inferences from Googlebot Page Download Stats in WMT

jerednel

7:18 pm on Jun 27, 2011 (gmt 0)

Let me preface this post with a note from the Webmaster Central article that states that "Googlebot may crawl more than the first 100KB of text." [google.com ]

I take this to mean that the amount of text crawled by Googlebot varies from site to site based on a number of factors.

I also take this to mean that for sites that may be beyond a 100Kb file size, that some text is not read by Googlebot.

In Webmaster Tools > Diagnostics > Crawl Stats we are given a few graphs, 2 of which are:

Pages Crawled Per Day

Kilobytes downloaded per day

Am I correct to assume that: (Kb downloaded per day) / (Pages Crawled Per Day) = Avg Kb downloaded per Page ?

And that all important content and links should be placed before this limit is reached?

Or am I delusional? Thanks!

Chrispcritters

7:41 pm on Jun 27, 2011 (gmt 0)

I believe that you are taking the note in a FAQ about the "fetch as googlebot" out of context.

jerednel

7:49 pm on Jun 27, 2011 (gmt 0)

The note in parenthesis that they have is speaking to Googlebot itself, not the Fetch as Googlebot feature. To say that they "may" fetch more than 100KB is somewhat ambiguous.

tedster

8:16 pm on Jun 27, 2011 (gmt 0)

Yes, the amount of crawling certainly varies from site to site, based on a complex calculation.

You will notice that googlebot does not crawl every page with the same frequence, not even on any one website. This means that taking the average, as you are doing, will not give you anything all that meaningful - because the average will be weighted by the specific URLs that were crawled.

And that all important content and links should be placed before this limit is reached?

"Before" doesn't have any context here. Google doesn't start on the home page, then "click" on every link etc. Instead googlebot downloads a page and records the URLs that it finds there in a back end database. All the recorded URLs are then put into a prioritized crawling queue by a complex algorithm. Everytime googlebot goes out crawling, it gets marching orders from that crawling queue.

That's what I mean by "before" not having any context. Just make your pages the minimum-optimum needed for performing their function for visitors.