Google crawling pattern exposed!

I run two sites that get crawled by Google. Both generate dynamic pages (up to about 115,000 of these). Each of these pages link to between a few, or a few hunderd, of the other pages. You can consider this linking to be random for the purposes of this discussion, but the data are related. (I'm not a "spider trap" playing games with Google; I'm a nonprofit database. A malicious spider trap would find it difficult to get a high-enough PageRank to pull off what I've experienced.)

Google had been doing my two domains now for six months. The pattern has been consistent. I'd like to comment on, and solicit the experiences of others, as to what I think happens at Google.

As of about four months ago, Google had something like 6,000 PCs networked together. I saw a quote from an employee who said that about ten percent are dedicated to crawling, another ten percent to research, and the rest to handling searches. Let's assume that the number of crawler PCs is 800. The number of searches they handle per day is now up to about 70 million, half from affiliates and half directly. I'm just trying to give you a feel for the scale here; I'm not trying to claim I've got the inside scoop. My numbers may be out of date.

The pattern is this: Google starts crawling the two sites slowly. The last crawl started around April 5. One of my domains has a higher PageRank than the other, and this is dramatically reflected in the difference between the rates at which Google crawls each site. For about 7 to 10 days, each site is crawled around the clock. It starts slowly at first -- once every couple of minutes total using two crawlers -- and at the end it's a frenzy of crawling. This end-time orgasm lasts at most two days. This last time it was more like 12 hours. When I say orgasm, I mean up to three searches per second on the high PageRank domain, and maybe one every five seconds on the lower-ranked domain (each dynamic page requires a search on our server unless it's cached, and it rarely is for Google's requests). The rate I mentioned for each site is for all five crawlers combined.

It's a load-killer at the end; I had to install a load thermostat for the Google requests, and start turning them away after a certain load was reached.

Five crawlers are working each site. They tend to be the same five on each of my sites, but I believe from what I have read that their crawlers don't talk to each other, and that all their machines run the same software. This makes their system easier to scale upward. I think there is some inadvertent duplication between the five crawlers in the short term, but I suspect this is weeded out later. I don't know the extent to which the five crawlers may be making the same requests because I haven't analyzed it that closely.

To give you an idea of the scale involved, if I do a site:www.my.domain home page (each page of mine has a "back to home page" link at the bottom), from the March crawl I get 44,000 hits for my first domain, and 27,000 from the second domain, on a normal Google search.

And then, WHAM! Right at the peak of frenzy, all crawling stops. This time it happened early morning on April 14. Curiously, the crawling stops on both of my sites at the same time. I take this to mean that all of Google has decided that their monthly (actually it's more like every five weeks now; they're slowing down a bit) crawl is over, and it's time to start processing the data. This processing will make it through the pipeline about the time the next crawl starts, or maybe a little later.

The point of all this is that I don't think that Google has any sort of quota for my site, but I know they make it about halfway into my database. The second point is that PageRank has a lot to do with how many resources (crawlers) that Google assigns to your site, how soon they start on it, and most importantly, how fast they will hit it.

Each crawler is presumably rotating among several sites at any one time to start with. But as the deep-web "to be crawled" queue builds up from a domain like mine for a particular crawler, that crawler pays more attention to that site and hits it more frequently. The ranking in this queue is controlled by PageRank. The reason my high-rank site gets hit more frequently is that the "to be crawled" list from that site is ranked higher than from the other site, so it consumes more of the the crawlers' attention sooner in the queue.

Is it generally recognized that PageRank has such a major impact on crawler behavior? I haven't seen this in print before, but it seems to me that my evidence is irrefutable. My two domains are very similar in design, function, and links returned, so PageRank is the only thing that can explain the difference in crawling rates.

This evidence supports observations made by other posters on this forum, to the effect that the best way to get Google to fly to your site and suck it up is to get a link to your main page on some other high-PageRank site. I also have a couple of other domains with static pages, and have noticed that when I link to them from my high-ranked site, Google goes there immediately.

This is just more evidence that the queue for a particular crawler's "to do" list is also ranked according to the PageRank of the referring site.

The phrase, "Tyranny of the Majority" comes to mind. I'd hate to be a site just starting out, trying to attract Google's attention. My high-ranked site has been out there for over six years now, and that's the only reason I'm in as good a position as find myself. My lower-ranked site has been around for just one year.

Google crawling pattern exposed!

How PageRank affects crawling behavior

Everyman

NFFC

dogboy

Everyman

msgraph

Brett_Tabke

dogboy

msgraph

dogboy

dogboy

Everyman

Everyman

NFFC

Everyman

MaliciousDan

Brett_Tabke

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week