Forum Moderators: open

Message Too Old, No Replies

is there a maximum number of pages google will crawl on a site?

and does it have anything to do with pagerank

         

stargeek

7:46 am on Dec 12, 2003 (gmt 0)

10+ Year Member



I've read two conflicting ideas, One that says google sees pages, not sites, and another that say there is a maximum number of pages per site google will crawl, and that it is based on that site's pagerank. These are obviously contradictory, whic one is right, and is there a total number of pages google will index?

Stefan

6:21 pm on Dec 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The association between homepage Page Rank, crawling and depth into the site, is one of frequency rather than number of pages as far as I know.

You read posts here by people with PR3 or 4 homepages who mention that it's mostly just the index/default pages that get hit, but it often seems that all of their pages are eventually found and indexed... just not visited by googlebot very often. If someone has a PR3 homepage and a lot of PR1 or 2 inner pages, then maybe they have some pages that never get indexed... I don't know, perhaps people with personal experience on that will post.

It's very important to have navigation that makes it easy for the bots to easily find all the pages. Site maps are good for that. If you have a page 5 clicks away from the index, and it only has one incoming link, it might take a long time to get found.

I haven't heard of an upper limit on the number of regular html pages that google will index from any one site.

<edit>typo

AjiNIMC

7:28 pm on Dec 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Recently I added 45 pages to my site, within a week all got indexed. My homepage is of pr 6 and the first level is of 5 pr. I added the pages at second level.

Intially when I use to have pr 2 homepage, it was tough to get indexed, like it use to take some time.

Thats what My experience says.

Aji

stargeek

8:56 pm on Dec 12, 2003 (gmt 0)

10+ Year Member



so then acordng to that stefan, google does in fact see "sites".
example if i have a corpus of pages and i split them between 2 domains of the same pr, google will in fact double the speed with which it indexs the whole body of content.

johnser

9:00 pm on Dec 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



x.com went live 2 weeks ago.
It has approx 200 new links from one PR6 site

Over 8,000 pages on x.com were crawled this week for the first time.
HTH
J

Stefan

9:14 pm on Dec 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



stargeek, I think it can still be seen in terms of pages rather than sites. Most people will probably have most of the incoming links going to the index page, so it gets the highest PR of all the pages on the site. Higher PR pages are crawled more often. Inner pages are getting most of their PR from the index so they have lower PR and are crawled less often.

At a guess, if a site had most links coming into inner pages rather than the index, then the inner pages would have higher PR and get crawled more often than the index.

If you were to split one site, you would also be splitting the incoming links and thereby lowering the PR of both sites so this wouldn't accomplish much, the way I see it.

This all presupposes that crawling is directly related to Page Rank and I don't know if that's entirely true.

To be honest, I have a bit of a flu and my brain is rather fuzzy, so my logic could be flawed.

stargeek

9:31 pm on Dec 12, 2003 (gmt 0)

10+ Year Member



johnser: your wording at least implies site not pages. surely not all of the pages these 200 links are on are pr6 if the "site" (i assume this means the index page) is a pr6. Also are these deep links?
Stefan: that sounds reasonable and I'd like to hear if anyone else has any antecdotal evidance to weather or not depth of crawls is related to pr.

johnser

1:32 am on Dec 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



stargeek: home is pr7, all other 199 internal pages = pr6
J

too much information

2:02 am on Dec 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got a site with about 125 pages, it's been live for a month. The only indexed page is the homepage which has a PR2, and everything else on the site is PR1.

It's taking a while to get a deep crawl, but PR seems pretty easy to obtain.

Bobby_Davro

2:35 am on Dec 13, 2003 (gmt 0)

10+ Year Member



I am fairly certain that there are some limits to the number of pages in relation to the PageRank, based on my own experiences of large sites without all pages indexed. I am sure that someone here will be able to do a proper study of this though.

The limits that we are talking about are very large though; 3000 for a PR4 site, 50K for a PR6 site and 70-100K for a PR7 site are the rough numbers that I would guess at. I am sure that there will be a more sophisticated system than this behind the scenes though.

There are also possibly "depth of crawl" limits. For example, it sometimes appears that Googlebot will only index two directory levels down for low PR sites, but this could just be coincidence.

Stefan

3:04 am on Dec 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bobby_Davro, that's astounding. Our home page is PR6, and we have a lot of PR5 pages, so we might have a 50k limit. Man, we're just over 160 now... got some room left.

Any other observations along those lines from anyone? Is there a roughly defineable max depending on homepage PR?

espeed

3:37 am on Dec 13, 2003 (gmt 0)

10+ Year Member



One of my sites is a strong PR6, and it has Google has indexed 109,000 pages -- googlebot visits 24x7.

stargeek

3:44 am on Dec 13, 2003 (gmt 0)

10+ Year Member



I have a strong pr 6 (normally a 7, will be soon again) with 10k pages on www2, i was wondering what the upper limit for that type of site would be and i guess i got my rough answer (way more than i need, only about 40-50k). I do get crawled pretty much 24/7 anywhere from 1k crawler hits a day to 10k although this includes the mediapartners bot.

too much information

4:51 am on Dec 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



from 1k crawler hits a day to 10k

No wonder Googlebot hasn't indexed my entire site yet! ;o)

doc_z

10:04 am on Dec 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've read two conflicting ideas, One that says google sees pages, not sites, and another that say there is a maximum number of pages per site google will crawl, and that it is based on that site's pagerank. These are obviously contradictory, whic one is right, and is there a total number of pages google will index?

My experience is that there is indeed a relation between PR and number of pages that will crawled. However, this doesn't mean that Google is seeing sites/domains.

In my experience, you will have problems to get pages crawled if the toolbar PR is lower than PR1. Therefore, the behaviour is page based (because PR depends on the linking structure) and it doesn't matter if these pages are on a single site/domain or not.

The number of pages that Google will crawl depends on the incoming PR as well as your linking structure. If all your incoming links are going to the index page (PRx) and you have no outgoing links to third party pages/sites, the number of pages that get crawled are approx. (proportional) 30*x for the worst linking strategy while they are roughly (prop.) 20^x in case of a perfect linking structure. (And it doesnt matter if these pages are on a single site/domain or not as long as there are no additional incoming links.)

If the links are links coming into different (inner) pages, the result is in principle the same. You just have to add up PR of all incoming links. However, deep linking makes it easier to get a flat PR distribution and therefore can increase the number of pages that are crawled.

percentages

10:16 am on Dec 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>The number of pages that Google will crawl depends on the incoming PR as well as your linking structure.

Yes....linking structure is the key. I have a new PR4 site that got 30,000+ pages indexed within 3 weeks, just because of a curious bot and a several strategically placed deep links.

GoogleBot seems to get bored if she has to always start at the same place and follow the same line for the milk and cookies every time. She's an explorer hungry for new stuff, but you have to help her to find the way :)

Bobby_Davro

2:00 pm on Dec 13, 2003 (gmt 0)

10+ Year Member



It is also worth pointing out that different data centres appear to hold different numbers of pages. For example (a real one), a good PR6 site might have a permanent 70K limit on '-gv' but a 100K limit on '-dc'.

This may mean that Google targets different datacentres to the regions that they feed, or that the separate datacentres have unequal capacities, or that they use different algorithms. Either way the data held by the centres varies (or is reported as varying).

I have also noticed that googlebot will crawl a lot more pages than it ever adds. I watched a new site (PR4) since launch. It had 22K pages crawled in the first month, but just 3K added. This wasn't due to duplicate content either.

johnser

2:31 pm on Dec 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good point Bobby_Davro - What do you do to encourage pages to stick?

BrianK

4:58 pm on Dec 13, 2003 (gmt 0)

10+ Year Member



>>What do you do to encourage pages to stick?

I'd be curious about this also - we launched a new site right as Florida was unfolding - all the urls are different as we moved to .htm extensions from .asp. I've watched 3 crawls that got the pages in www3, then www2, and then they dissappear - it's happened 3 times (our home page is PR6 an gets spidered daily). As some pages get page rank, they seem to stick, so i *think* that's got something to do with it, but even a couple pages with 0 rank are sticking....it's a mystery to me.

edit>> BTW - we've done 301 redirects on all the old pages to the appropriate new pages

Brian

stargeek

7:13 pm on Dec 13, 2003 (gmt 0)

10+ Year Member



wow great post, thanks for the info. Any pointers on "perfect" linking structure vs "worst"?