Forum Moderators: open
Google had been doing my two domains now for six months. The pattern has been consistent. I'd like to comment on, and solicit the experiences of others, as to what I think happens at Google.
As of about four months ago, Google had something like 6,000 PCs networked together. I saw a quote from an employee who said that about ten percent are dedicated to crawling, another ten percent to research, and the rest to handling searches. Let's assume that the number of crawler PCs is 800. The number of searches they handle per day is now up to about 70 million, half from affiliates and half directly. I'm just trying to give you a feel for the scale here; I'm not trying to claim I've got the inside scoop. My numbers may be out of date.
The pattern is this: Google starts crawling the two sites slowly. The last crawl started around April 5. One of my domains has a higher PageRank than the other, and this is dramatically reflected in the difference between the rates at which Google crawls each site. For about 7 to 10 days, each site is crawled around the clock. It starts slowly at first -- once every couple of minutes total using two crawlers -- and at the end it's a frenzy of crawling. This end-time orgasm lasts at most two days. This last time it was more like 12 hours. When I say orgasm, I mean up to three searches per second on the high PageRank domain, and maybe one every five seconds on the lower-ranked domain (each dynamic page requires a search on our server unless it's cached, and it rarely is for Google's requests). The rate I mentioned for each site is for all five crawlers combined.
It's a load-killer at the end; I had to install a load thermostat for the Google requests, and start turning them away after a certain load was reached.
Five crawlers are working each site. They tend to be the same five on each of my sites, but I believe from what I have read that their crawlers don't talk to each other, and that all their machines run the same software. This makes their system easier to scale upward. I think there is some inadvertent duplication between the five crawlers in the short term, but I suspect this is weeded out later. I don't know the extent to which the five crawlers may be making the same requests because I haven't analyzed it that closely.
To give you an idea of the scale involved, if I do a site:www.my.domain home page (each page of mine has a "back to home page" link at the bottom), from the March crawl I get 44,000 hits for my first domain, and 27,000 from the second domain, on a normal Google search.
And then, WHAM! Right at the peak of frenzy, all crawling stops. This time it happened early morning on April 14. Curiously, the crawling stops on both of my sites at the same time. I take this to mean that all of Google has decided that their monthly (actually it's more like every five weeks now; they're slowing down a bit) crawl is over, and it's time to start processing the data. This processing will make it through the pipeline about the time the next crawl starts, or maybe a little later.
The point of all this is that I don't think that Google has any sort of quota for my site, but I know they make it about halfway into my database. The second point is that PageRank has a lot to do with how many resources (crawlers) that Google assigns to your site, how soon they start on it, and most importantly, how fast they will hit it.
Each crawler is presumably rotating among several sites at any one time to start with. But as the deep-web "to be crawled" queue builds up from a domain like mine for a particular crawler, that crawler pays more attention to that site and hits it more frequently. The ranking in this queue is controlled by PageRank. The reason my high-rank site gets hit more frequently is that the "to be crawled" list from that site is ranked higher than from the other site, so it consumes more of the the crawlers' attention sooner in the queue.
Is it generally recognized that PageRank has such a major impact on crawler behavior? I haven't seen this in print before, but it seems to me that my evidence is irrefutable. My two domains are very similar in design, function, and links returned, so PageRank is the only thing that can explain the difference in crawling rates.
This evidence supports observations made by other posters on this forum, to the effect that the best way to get Google to fly to your site and suck it up is to get a link to your main page on some other high-PageRank site. I also have a couple of other domains with static pages, and have noticed that when I link to them from my high-ranked site, Google goes there immediately.
This is just more evidence that the queue for a particular crawler's "to do" list is also ranked according to the PageRank of the referring site.
The phrase, "Tyranny of the Majority" comes to mind. I'd hate to be a site just starting out, trying to attract Google's attention. My high-ranked site has been out there for over six years now, and that's the only reason I'm in as good a position as find myself. My lower-ranked site has been around for just one year.
Great post!
>recognized that PageRank has such a major impact on crawler behavior? I haven't seen this in print before
Have a peek here [webmasterworld.com], I'd be interested to know if it lines up with your experiences in any way.
Does anyone know the application date (approximately) on Google's patent application for PageRank? Since Larry Page presented all this stuff at conferences in early 1998, he would have one year from that time to file a patent. Otherwise, the entire application is Dead on Arrival at the PTO. It seems to me that if they got in under the wire, it was very close. If they didn't get in under the wire, the PTO should be made aware of the situation.
I have about 500 domains on each server so you can imagine what happens when you get spidered by the majors and then some lesser-known one comes along to make things worse.
...at least that is what I heard:)
"(you don't want) to stop the big boys" ...not that it *can't* be done. ...my bad. I missed all that:)
I was thinking just about google... because googlebot is so intense (and relatively predictable) you could tweak the box just while he is in there.
...yeah, I got you now. sorry about that.
I read "Efficient Crawling Through URL Ordering" by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, and have some comments on how it is related to my experience with Google.
Their paper discusses several alternative methods of ordering the URL queue of a crawler, and evaluates the extent to which a particular ordering algorithm, or combination of algorithms, improves the efficiency of that crawler.
Their entire premise is based on a definition of page importance that has its foundation in PageRank -- i.e., the more you have important backlinks to a particular page, the more important it is to crawl that page first. You won't make it through the web, so you have to order your crawling priorities.
It's possible to measure to some small extent the anticipated content of a page being considered for queue placement. One way to do this is its proximity to surrounding pages that are known to be important. An example of this would be other pages in the same directory, or perhaps one or two directories removed, within the same domain.
One can also anticipate the content of a page by the text in the anchor that points to that page. There is evidence that Google gives extra weight to anchor text, so one would be well-advised to avoid the overused "Click here" in one's anchors. Instead, spell out more specifically what you will be clicking on in the anchor text itself. For Google, the value of the word "click" in an anchor is inversely proportional to the number of times it is used on the web. That's pretty low.
All these considerations are put together into formulas, and each is weighted. Experiments were run at Stanford on a little web to determine the relative crawler efficiency of various approaches. The predictable academic conclusion is that they're all interesting, and in some cases, some are better than others.
Their scheme is recursive, as "efficiency" of a crawler is determined by how many "important" pages it finds in a given time, as opposed to how many it would find using a random queue. "Important" is defined as PageRank, which in turn is based on one's prior history of crawling. It all feeds on itself. To put it another way, you can't know the content of the page before you fetch it, so you're stuck with what you can know.
In my practical experience with Google crawling, PageRank is supreme. It's the only thing that accounts for the radical difference in the crawl rate on my two sites, as all other factors are equal. I'm referring to the PageRank of my two root domains[URL's snipped NFFC]. All dynamic pages are /cgi-bin/XXXXXX.cgi?YYYYYY down from those two roots. One site's primer page of dynamic links is reverse-sorted from the other site's primer page. The hope is that this will increase Google's ability to get through the data. So far I'm unable to anticipate Google's ordering by studying our logs. Theoretically, if both domains had the same PageRank, and each got half-way through the data, they'd meet in the middle.
Initially Google's rotation algorithm keeps the load off of any particular server, but after 7 to 10 days of crawling a site like mine, this breaks down. It was not designed for a "deep web" situation. If someday they put algorithms in the process to address the "deep web" load problem, then they wouldn't get nearly as deep into my sites as they currently do.
The alternative for Google is to start using human judgment about which deep web sites are worth extra effort. There is a profound resistance to human intervention at Google, judging from my experience of trying to discuss this issue with them. They want everything automated.
Clearly, Google is cutting off their crawl of the entire web due to time and resource constraints. If they crawled less frequently, then they could crawl deeper. But with the advent of sites such as www.moreover.com, with a turnaround of hours instead of 30 to 60 days, the pressure is surely in the other direction. Google already shows signs of slowing down. This makes their crawler ordering more important, not less important.
Google's Big Crawl consists almost entirely of a popularity contest that was decided in advance of the crawl. It feeds on its own assumptions to a degree that could be self-defeating at some point in the future. To put it another way: If the entire web turns into a landfill, then Google will start to stink first. That's because it will effectively detect and reflect such decline earlier than other engines. Given its position as the search engine of choice, it will inadvertently promote such decline. Their algorithms are, in other words, brilliantly oblivious to content.
There are exceptions: they appear to avoid many porn sites, and they may have decided to give more weight to dot-orgs or dot-edus or dot-govs as opposed to dot-coms. However, their PageRank rule is more significant than these exceptions, except perhaps for porn sites. Top-level domain tuning and porn screening can only go so far toward ensuring quality.
Yet given the task Google has set for itself, how else could you handle it? I predict a general fragmentation of the web into specialty sites, searched by specialty engines that are more flexible. These engines will be staffed by people who, in many cases, will make special arrangements to present unique or high-quality databases in their search results.
That has not happened yet. My deep-web situation is a database called "NameBase" that is entirely unique, of world-wide interest to a broad spectrum of users, has been developed over 18 years, and about 90 percent of it is material that is unavailable elsewhere on the web. No search engine has made inquiries about how to crawl it. In fact, Google's search algorithms are the only ones I've seen that work fairly well for searching it. Google did not respond to my proposal to make it easier to absorb this data.
Google is the last word in web-wide searching, but it may also be the last gasp. I wish them luck, though, because they've certainly made my life more interesting.
Edited by: NFFC
I started a discussion of this the other day under a different thread, under "Search Engine Spider Identification":
[webmasterworld.com...]
There are two elements to this: a) detection and identification of spiders, and b) what to do next.
For spiders I don't like, just give them a "Server too busy" message and exit. For spiders I like but are coming too fast (that's Google), I do the same thing but revert back to honoring the request once the load drops below my threshhold. Maybe I should just do a seconds() delay instead, but since Google is only getting through half my data anyway, who cares?
I think it's important to use "C" rather than another language. Speed is everything. I know I've opened up a can of worms with that opinion!
Anyway, let's continue this on that other thread....
Are you saying that the two sites are *exactly* the same, that would be a first in my experience. Even a mirror is a mirror, if you get my point.
The sites are exactly the same once you get past the domain name and the /cgi-bin/XXXX.cgi
The cgi programs are named differently for each domain, but the QUERY_STRING after the question mark is exactly the same for all 115,000 dynamic pages. The links on any page all use the same domain name of the page itself, so someone on one domain doesn't get thrown into the other domain. I have separate logging for each domain, which makes it easy to see the difference in crawling rates.
Google thinks it's two separate sites, while the QUERY_STRING and the anchor text -- essentially a repeat of the QUERY_STRING -- are identical on both sites. In fact, they are on the same server and access the same directory for the data.
I have different static pages on one site, as it's a "straight" version for uptight librarians (don't trust anyone under 50, because they didn't experience the 1960s), whereas the big site is in-your-face political. That's the one that's been around for five years. Before that it was on telnet for a year.
I'm talking about how far Googe gets into the 115,000 dynamic pages, so the only difference Google sees is my two different domain names, both of which are dot-orgs.
Therefore, it's a safe bet that what's driving the radically different behavior by Google is the different PageRanks of each domain name.
Here's something else that's interesting: This thing is a proper name index of individuals, corporations, and groups. For individuals, it's always in the form of surname, first-name. The reasons it works well in Google's algo, and doesn't do well with other bots, are two:
a) Google tracks word position on each page. If you look for John F. Smith without quotes, the fact that the words in my anchor are SMITH JOHN F hit in Google. Moreover, they hit with a very high score, because the three (yes, Google respects my middle initial) words are very close together.
b) My search terms are always in the anchor text, which gives me even more points.
If you look for "John F. Smith" in quotation marks, you won't find it. (This is a bad example; Smith is too common and I won't come out close to the top. But you can see where it will work nicely on less common names.)
Currently I'm getting about 3,000 name searches per day, about half of which are referrals from Google. It gets even messier, because sometimes the "John" portion of the name is farther down the page and separated from the surname you asked for. But the ranking points from word proximity that Google uses prevents this from being too much of a problem.
Now you can see why Google is the only bot I can let into my cgi-bin. Any other algorithms make mincemeat out of my 18 years of indexing.
Google probably starts by figuring out which terms are searched the most, and spidering the highest ranked sites under those terms and works it's way down.
That is very true about Google starting at the top of the PageRank and moving downwards as it spiders.
Has anyone noticed a pattern to how it crawls new sites? I mean sites submitted, not just found. Earlier this spring we moved a hundred sites to new domains and it was interesting to watch Googlebot go at it. Just wondered if anyone else had noticed anything peculiar in that regard.