|No incoming links = reduced crawl rate?|
I just noticed that a group of about 17 pages on my site have no incoming links whatsoever, according to various metrics reporting services, and so I went back and checked for referrals in my server logs. Nada, not even Googlebot, not for the past three weeks.
Just wondering what others are seeing, if you watch googlebot for crawl behavior in your logs, do the pages of your site which have no incoming links get much googlebot attention?
Do you mean, they are at least in the sitemap but not linked to? Otherwise how would any bot to even know they exist? I mean, Gbot pokes around with query strings and site searches and such but it may take them awhile to find the pages on their own.
As far as crawl behavior, I often see Google come pick up all pages in a simple alphabet order, thousands of them - sounds like something they've been collecting for a while from sitemaps, IBLs etc, then sorted in the simplest way possible by a URL and then came banging on the site.
That is what I call a strange behavior because it sounds like at some point in the Google system the URL and the date it's been first collected are separated from each other and that just opens up a possibility of errors in appointing the original source hence scrapers outranking the original source. The source would presumably have an older timestamp but if you don't keep your timestamps with your URLs, and then lost the reference between the two, you've lost a very important info about the URL. The whole sorting them by alphabet sounds so very basic and not Google-like. Could they not afford to keep a 10-char epoch timestamp together with the URL?
Sorry, Sarge, I might have gone on a tangent here about Gbots "behavior" in general. I can't think of a situation where a page gets completely orphaned, not even show up in the site search and the likes. I would guess they've probably already collected those pages and you've missed that initial visit. No-IBL pages must not be important enough to return to them often, so that's why you are not seeing more hits after the initial ones.
*One* of the factors in the crawl budget of a site is PageRank, so yeah, more incoming external links to a page means higher PageRank means more crawling.
My experience has been that your site has scheduled deep crawls where Googlebot arrives and follows the various internal links on your site in order to discover new and/or updated content (as well as possible links to other sites). They may or may not crawl all of your pages, depending on the number you have. As jimbeetle indicated, the more PageRank a site has, it seems the deeper they are willing to crawl and the more pages the engine is typically willing to index from your site. But I'm guessing that this phenomenon is more correlation than causation... that the phenomenon has more to do with the following paragraph than simply the overall PageRank number.
On top of your site's scheduled deep crawls, all of the sites that link to your site also have their own scheduled deep crawls. And when Googlebot visits each of those external linking sites for their scheduled crawl, follows their links, and stumbles onto a link from their site to your site... they will request your URL on your site to see if that link is still valid (200), is being redirected (3xx), or no longer exists (4xx). Googlebot often doesn't stop there.
While they are checking that externally linked URL on your site to see if the external inbound link is still good, they will often perform an incremental or partial crawl of your site, crawling other pages on your site in close proximity to the linked page. I'm guessing the number of pages crawled in close proximity to the linked page is based on the amount of PageRank/link juice being passed in via that external links... but who really knows. But this is why having deep links into your site are so important. They help get the deep content around that deep linked page crawled and indexed as well.
This previous paragraph is why it when a site gets redesigned and tons of 301 redirects get implemened, that the site typically loses traffic/rankings for some period, often several weeks, before rankings/traffic return. You have to wait for each one of the sites/pages linking to your site to be recrawled one-by-one, the 301 redirect for that specific link to be discovered, and credit for that specific link to be transfered to your new URL. During this transition the old URL which was ranking has fewer and fewer links while the new URL's links grow until all links have been recrawled. PageRank, being a recursive algorithm, also likely takes several crawls of all inbound links before it can be properly calculated (approaches some asymptotic value).
So yes... Googlebot and other crawlers are definitely going to crawl pages with external links more often than pages with only internal links. Those with external links have a chance of getting crawled on your scheduled crawls as well as each time a linking page on another site is crawled. Those with only internal links only have the chance of being crawled as part of your site's scheduled crawls.
A site I managed for 3 years had 5MM backlinks, and Googlebot and the other major crawlers were on that site literally 24x7 crawling. The site had less than 5,000 pages and they rarely updated the pages or added new content. So there was no real reason it warranted that much activity.... definitely not the freshness of the content. Instead, it was all of the incremental crawls due to scheduled crawls of other 5MM external linking sites/URLs that kept the crawler on the site
From what I've learned, googlebot does not usually crawl a site by following links the way a human user would. Instead, the crawl team LEARNS about other URLs by indexing pages. Then they put those URLs in a crawl list which is prioritized by a complex algorithm. The most common crawl is one where googlebot is "given its orders" from the beginning - a list of URLs to crawl on the site.
|But I'm guessing that this phenomenon is more correlation than causation |
Matt Cutts has confirmed something like that several times. Actually, it's a little bit stronger than simple correlation. Even though there are many factors affecting googlebot's crawl, the biggest determining factor is PageRank. It really is causation that's at work, but not exclusively.
Remember that every URL has its own PageRank score, so PR is not something that a "site" has. This means that the interlinking of your pages has a lot to do with how often any particular page will be crawled. That's because the pages' interlinking will determine how PR is circulated around all the pages of the site. Other factors can be things like the update history of a page's content. If a site tends to publish static pages and not change them, then there's lower reason to crawl them frequently.
But a site as a whole does have a crawl budget, at least that's my understanding of the current conventional wisdom. I do remember there was talk about it and also about "using it wisely" as in not creating too many bad URLs that Gbot crawls only to realize that that those are not the URLs to index.
|Remember that every URL has its own PageRank score, so PR is not something that a "site" has. |
How the crawl budget is assigned to a site is a bit of a mystery to me but I do make an assumption that this is a measure of the site's standing with Google overall. Call it homepage PR or something else but there must have been reason they put "Crawl Stats" into the "Health" section of WMT.
Complex alphabet sorting algorithm ;) LOL.
|Then they put those URLs in a crawl list which is prioritized by a complex algorithm. |
BTW, how would individual ranks of pages explain alphabetized carpet-crawing? It would seem that the alphabet order of URLs would have no correlation to the distribution of rank (in any way, shape of form that it's doled out) between the pages. So do they have different prioritized and "OMG-we-dont-know-what-to-do-with-these-urls-but-want-to-index-them-anyway" crawling lists?
|how would individual ranks of pages explain alphabetized carpet-crawing? |
LOL - it wouldn't, and those alphabetized crawls have only ever confused me.
Maybe they start out as a prioritized cherry pick list of the URLs they want to crawl for some reason and it gets alphabetized before they hand googlebot marching orders? That's all I could ever come up with, and it's a stone cold guess. The reason that's my guess is that I've never seen an alphabetized crawl that hits EVERY URL that Google has indexed.
|From what I've learned, googlebot does not usually crawl a site by following links the way a human user would. Instead, the crawl team LEARNS about other URLs by indexing pages. Then they put those URLs in a crawl list which is prioritized by a complex algorithm. The most common crawl is one where googlebot is "given its orders" from the beginning - a list of URLs to crawl on the site. |
When you launch large sections of new content, Googlebot most certainly follows links within that section before the pages are indexed. I've done some experiments where I have launched chains of pages where each page links to the next. Googlebot has marched down this chain, one page after another for each of the 1000 pages I had in the chain on my site. All well before the pages were indexed. I call this behavior "Fresh Googlebot".
After this initial crawl where Google follows all the links it cand find and greedily crawls the new pages, it reverts to crawling based on PageRank. Pages with higher PageRank are re-crawled more frequently. I call this behavior "PageRank Googlebot".
I have never personally seen GoogleBot crawl in alphabetical order, but I have seen it crawl in url-length order. For me this happens for a batch of old urls that have no current inbound links but which existed on the site at one point in time. (In my case they all 301 redirect to new urls now.) Googlebot will often crawl 1000 of these pages in a sitting, one right after the other, in url-length order starting with the shortest urls. I call this behavior "Stale Googlebot".
[edited by: tedster at 5:31 am (utc) on Dec 11, 2012]
Your description of "fresh googlebot" makes sense to me. I never tested it so thoroughly, so thanks for that report very much.
In fact, the idea that googlebot has a large vocabulary of crawl behaviors instead of just one makes a whole lot more sense.
@deadsea: good catch! I forgot to mention that one! Also, a combined alphabetical AND URL length at the same time - downright creepy. Happens on forum sites all the time and easier to catch when it gets stuck on slight variations to some very popular topic that comes up over and over. I convert URLs into readable form, same as title but shortened to 60 symbols or less. Perhaps I get caught up in this alpha/length ranges because my URLs are less diverse in size than normal 'cause I cut down all long ones to exactly 60 chars.
|I have never personally seen GoogleBot crawl in alphabetical order, but I have seen it crawl in url-length order. |
Anyway, didn't want to hijack the thread, just to say that both types of crawl are possible and can combine with one another at the same time. And neither seems to have anything to do with rankings, real or PR.
I didn't mean to imply in any way that crawlers crawl sites exactly the way humans do. I may have oversimplified my explanation, but didn't feel the need to explain how URLs indexed from a site might be queued up for crawling, get crawled, and get indexed.
And I too have heard Cutts say on numerous occasions say that PR has a lot to do with not only how often sites/URLs are crawled, but also how may and which pages from a site are indexed. And yes, internal linking structures (navigation links) DEFINITELY play a part in which pages are crawled and indexed as they are strong signals as to which pages the webmaster deems most important, second in importance, etc. and they control the flow of most PR around the site.
But I think everyone will agree that there is some scheduled frequency at which a site's URLs are crawled. Over time a site's URLs likely get divided into sets with different crawl frequencies, but when a site is new with no inbound links the PR of all of the URLs on the site are infintesimally small. So most all of its URLs fall in the same "set".
This seems to be blatently obvious for new sites or those with no or few links. For new sites this frequency might start out something like once per month. Huge spikes in crawling activity are seen in WMT and server logs periodically (like once every 3-5 weeks), though a few pages (maybe home page and a few of the 1st level pages) might get crawled in between. I think these spikes are the scheduled "carpet-crawling" events to which 1script was refering and the schedued deep crawl events that I referred to where seeming most of the URLs known for that site seem to get crawled in a short burst.
This is likely because they where all queued up at the same time based on the crawling algos schedule for that domain or "set" of pages for that domain. I think these queued up URLs simply become "seed" URLs for the crawling process. But the crawler likely has some liberty to "on the fly" not just crawl those specific seed URLs from the queue, but also to follow links on those seed URLs to crawl other URLs not yet scheduled to be crawled... perhaps under certain circumstances even to crawl, for example, any URL within X hops of that seed URL.
As you mentioned, freshness definitely plays a factor in crawl frequency. If Googlebot returns for a deep crawl in a month and no new pages or no updated content are discovered... see ya next month (for the most part)! But if hundreds of new pages are discovered, they'll likely return sooner the next time... maybe in two weeks. These adjustments in scheduled crawl frequency continue until the crawling algorithm finds a balance between the rate at which your site generates content and the rate at which they deep crawl your site.
However, I think external links also play a big roll in crawl frequency. From what I've seen, they trigger incremental, partial crawls of small sections of your site between those URLs scheduled crawls. Not only do the known URLs on your site get queued up to be crawled periodically based on their schedule crawls, but those URLs on other sites that link to your site also get queued up for their scheduled crawling in between your URLs scheduled crawl events. And when crawling those external URLs that link to your URL, I do believe the crawler often takes the liberty to crawl not only the page on your site being linked to by that external "seed" URL, but one or more pages on your site in close proximity to your URL being linked to.
Honestly, why would they call it a crawler if it were not allowed to "explore" in an attempt to discover new content. It seems they could simply have labeled it a "fetcher" that reads a fixed set of URLs from a queue and fetches the documents at those addresses. I am pretty sure that prioritized list that gets queued up is simply a seed list to make sure that the various URLs "at least" get crawled ever so often, but links from other sites can trigger them to be crawled more frequently.
Perhaps I'm wrong, but that has been my experience. Think I'll do some testing with some brand new domains and track Googlebot activity over the first 6 months as they go from no links to having links to see if anything can be learned.