Forum Moderators: open
You read posts here by people with PR3 or 4 homepages who mention that it's mostly just the index/default pages that get hit, but it often seems that all of their pages are eventually found and indexed... just not visited by googlebot very often. If someone has a PR3 homepage and a lot of PR1 or 2 inner pages, then maybe they have some pages that never get indexed... I don't know, perhaps people with personal experience on that will post.
It's very important to have navigation that makes it easy for the bots to easily find all the pages. Site maps are good for that. If you have a page 5 clicks away from the index, and it only has one incoming link, it might take a long time to get found.
I haven't heard of an upper limit on the number of regular html pages that google will index from any one site.
<edit>typo
At a guess, if a site had most links coming into inner pages rather than the index, then the inner pages would have higher PR and get crawled more often than the index.
If you were to split one site, you would also be splitting the incoming links and thereby lowering the PR of both sites so this wouldn't accomplish much, the way I see it.
This all presupposes that crawling is directly related to Page Rank and I don't know if that's entirely true.
To be honest, I have a bit of a flu and my brain is rather fuzzy, so my logic could be flawed.
The limits that we are talking about are very large though; 3000 for a PR4 site, 50K for a PR6 site and 70-100K for a PR7 site are the rough numbers that I would guess at. I am sure that there will be a more sophisticated system than this behind the scenes though.
There are also possibly "depth of crawl" limits. For example, it sometimes appears that Googlebot will only index two directory levels down for low PR sites, but this could just be coincidence.
I've read two conflicting ideas, One that says google sees pages, not sites, and another that say there is a maximum number of pages per site google will crawl, and that it is based on that site's pagerank. These are obviously contradictory, whic one is right, and is there a total number of pages google will index?
My experience is that there is indeed a relation between PR and number of pages that will crawled. However, this doesn't mean that Google is seeing sites/domains.
In my experience, you will have problems to get pages crawled if the toolbar PR is lower than PR1. Therefore, the behaviour is page based (because PR depends on the linking structure) and it doesn't matter if these pages are on a single site/domain or not.
The number of pages that Google will crawl depends on the incoming PR as well as your linking structure. If all your incoming links are going to the index page (PRx) and you have no outgoing links to third party pages/sites, the number of pages that get crawled are approx. (proportional) 30*x for the worst linking strategy while they are roughly (prop.) 20^x in case of a perfect linking structure. (And it doesnt matter if these pages are on a single site/domain or not as long as there are no additional incoming links.)
If the links are links coming into different (inner) pages, the result is in principle the same. You just have to add up PR of all incoming links. However, deep linking makes it easier to get a flat PR distribution and therefore can increase the number of pages that are crawled.
Yes....linking structure is the key. I have a new PR4 site that got 30,000+ pages indexed within 3 weeks, just because of a curious bot and a several strategically placed deep links.
GoogleBot seems to get bored if she has to always start at the same place and follow the same line for the milk and cookies every time. She's an explorer hungry for new stuff, but you have to help her to find the way :)
This may mean that Google targets different datacentres to the regions that they feed, or that the separate datacentres have unequal capacities, or that they use different algorithms. Either way the data held by the centres varies (or is reported as varying).
I have also noticed that googlebot will crawl a lot more pages than it ever adds. I watched a new site (PR4) since launch. It had 22K pages crawled in the first month, but just 3K added. This wasn't due to duplicate content either.
I'd be curious about this also - we launched a new site right as Florida was unfolding - all the urls are different as we moved to .htm extensions from .asp. I've watched 3 crawls that got the pages in www3, then www2, and then they dissappear - it's happened 3 times (our home page is PR6 an gets spidered daily). As some pages get page rank, they seem to stick, so i *think* that's got something to do with it, but even a couple pages with 0 rank are sticking....it's a mystery to me.
edit>> BTW - we've done 301 redirects on all the old pages to the appropriate new pages
Brian