Number of Pages Indexed

Forum Moderators: open

Message Too Old, No Replies

Number of Pages Indexed

How does google determine how many pages to index?

webdevsf

4:49 am on Jan 15, 2003 (gmt 0)

My site currently gets deep-crawled by google around once a month. Usually, it indexes between 40,000 and 80,000 pages on our site. This is a good start, but there are probably 10-100x that many pages to be indexed - with very good quality information that is considered very useful on the web.

What are the factors that determine how many pages get crawled?

Here are the ones I can think of:

PageRank - I have a PR5 right now.

Page Response Time (my site is db driven and sometimes there are db issues)

Any other factors? Thoughts?

Thx

Chris_R

4:59 am on Jan 15, 2003 (gmt 0)

you have 400,000 to 4,000,000 pages with "with very good quality information that is considered very useful on the web"?

Impressive.

The PR of the individual pages is probably what is most important.

Check your page response times from other places besides you own. There are lots of places on the web that will do remote trace routes and such.

With a PR5 on your main index page - I would be happy to get what you are getting now....

Brett_Tabke

10:41 am on Jan 15, 2003 (gmt 0)

Ditto what chris said. There has been speculation over the last few years that the max number of indexable pages is directly tied to the pr value of the site and individual pages. No way to prove that though.

Grumpus

12:11 pm on Jan 15, 2003 (gmt 0)

PR and response time are the keys. I'm looking at about the same size site you are and my PR is roughly equal (by the front page (PR5) anyway, I've got some internal PR6's that seem to help). Response time is key also. There's a certain "duration" of the crawl and Googlebot seems to check response time and if they're coming fast and clean, it'll hit at 6-8 pages a minute. If it's slow, it'll back off accordingly.

The question, here, though, shouldn't be "how do I get it to crawl more?" but rather, "How do I get it to crawl the right stuff?" I'm sure you've got a ton of pages that you consider worthy, but in reality there are only certain ones that are really going to generate traffic and sales. Hot topics come and go in searches.

It took me six months to nudge the bot in the right direction to crawl the "hot stuff" first and then go through the rest of it as backfill. Just creating a "new and updated" page as a map won't do it because Google won't really KNOW that that is your "new and updated" page. You've also got to work out a way to teach it that that's the page it needs to use as it's seed for the crawl. This is done through internal linking structure. I imagine that most people here would call me crazy when I say that I actually had to make a conscious effort to LOWER the PR of various pages on my site just so Google would know where to crawl.

The site in my profile is the one I'm talking about if you want to have a look, but it's a lot more than just building the pages. Link structure (and quantity) is key. I wish I could be more specific for you, but it's rather a black art that I don't really fully understand myself. Keeping an eye on the habits of the "freshbot" though, will definitely help you decipher the logic (or seeming lack of logic) of the main crawl's direction.

ciml

1:03 pm on Jan 15, 2003 (gmt 0)

I agree with Grumpus that it's worth channeling PageRank to where it's needed.

I have no idea if things have changed since, but a few months ago, I had a site which had PageRank entering only through the home page, and had various paths leading deeper into the site. Sections of the site where the hierrarchy 'ancestors' had fewer links were crawled deeper than sections where the hierrarchy 'ancestors' had more links.

Googlebot's behaviour was as though it stopped spidering at approximately the same position on the Toolbar PageRank scale (roughly four notches off the bottom). It's possible that there's a 'didn't finish spidering in time' limit as well as a seperate PageRank limit but it seems more likely that the latter is a function of the former.

This was for pages with the same server response time; others have found that the response time makes a significant difference.

flex55

2:47 pm on Jan 15, 2003 (gmt 0)

If googlebot hits 50k pages - would that mean that ~50k pages are in the index - I mean, I'm sure there's a "digesting" process that will try do some calculation and factoring, and will "choose" the best pages from the pages googlebot got-
but, how different is the number of pages googlebot got than the actual number of pages that google will hold in its stomach?

Flex.

Grumpus

3:04 pm on Jan 15, 2003 (gmt 0)

flex55 - Actually, if it's in the index, Google represents it fairly. I haven't seen where it favors a page more than another just because it's "digested" or whatever.

The key here isn't to increase the quantity of pages that it'll crawl in a month. That comes naturally as you increase your inbound links and your PR goes up. The amount of work involved in going from PR5 to PR6 is a lot.

The key IS to nudge the bot toward the pages that are going to get traffic and/or sales. For me, my site's about movies and soundtracks so I pretty much know a month ahead of time which pages I need Googlebot to crawl because they'll be HOT items that month. For other sites, it may not be as easy to predict what the hot topics would be (For example, how the heck would I know that Pete Townshend was going to be a hot topic right now?)

Figuring out which pages are hot, then getting the bot to get all of those is a lot of work, but it's far less work and a lot more effective than trying to get the bot to crawl more/deeper. Once one masters that (well, you probably can't MASTER it, but once you stumble on the linking/navigation formula that seems to do the trick) then it really doesn't matter that 80% of the site content doesn't get crawled. The stuff that people are going to be looking for frequently DID get crawled. Then they bookmark my site and come back to me directly to search for those hard to find and less popular things. :)

webdevsf

5:02 pm on Jan 15, 2003 (gmt 0)

I think you are right that there is a fixed amount of time that the deeb gbot will crawl your site, regardless of who you are. It seems to me to be about 3-4 days.

I have a slightly different strategy about getting pages indexed though. Some of our "hot" pages get about 50-100 hits a day, but overall, the "hot" pages probably contribute 20% or less to the overall google search hits. MANY (75%+) of my hits are "one-off" on the tens of thousands of other indexed pages. So I have to make sure that as much of this ground gets "covered". So basically, i have a paged index of links into the site (in a Yahoo directory structure, sorta).

I've come to the conclusion that I can't worry about internal pagerank - I have internal pages with PR1 that get hit more than PR4. It seems to do with the page title, honestly. I'd rather have more pages in the google index that are lower PR.

uber_boy

4:21 am on Jan 16, 2003 (gmt 0)

I think the advice you've received so far is appropriate. But in light of what you've told us, I think you've pretty much got things figured out. By this I mean that, based on my experience, googlebot decides a maximum number of pages it's willing to read on the basis of PR (and other considerations?) and then, during its crawl, it gobbles up as many of thsoe as it can based on your response time. To wit, I manage two sites with PRs of 8 and, as best as I can tell, the 100-150k pages it reads at each site each month are solely a function of response time as googlebot hammers both sites non-stop for an entire week when it's crawling. I've just added a second server so perhaps next month I'll change -- or reinforce -- my tune. In the interim, I'd encourage you to focus your efforts in increasing your PR and response time...

Grumpus

12:19 pm on Jan 16, 2003 (gmt 0)

Uber Boy - Let us know how that second server changes the depth, if at all. My site and webdevsf's both have a homepage PR of 5 (though mine hits six on occassion, so it's borderline) and we are getting roughly the same amount of pages crawled. When I hit PR6, it goes up about 20-30K pages or so. With your PR 8 and higher numbers, it'd be interesting to see what happens if PR remains constant but speed is changed dramatically.

Thanks!

hskfun

3:12 am on Jan 17, 2003 (gmt 0)

I, too, work with a pr8 site. I had been
getting monthly crawls of about 130k pages. This
last crawl the total number fell to ~100k pages.
This site probably has 500k pages in total. I
am worried that if I put a full site map (link to
every page) that the "wrong" pages will be crawled
in preference to the best ones currently in the index.

In terms of response time, I've certainly had googlebot
crash my site (returning several dozen 500 errors),
and googlebot seems to lay off for awhile, but then
comes back and grabs more pages. No correlation
that I can see between the crawls in which googlebot
has had to take a break, and the ones that were clean.

As grumpus says, you can clearly help googlebot understand
what pages you want crawled, by putting
those links on your high PR pages. For me,
I may have 100k favorite pages, but
I'm still trying to figure out how to get all 500k pages
in the index. I have seen sites with a PR7 with twice
as many pages in the index as mine. Perhaps they have
more links to interior pages than I ... not sure.

I have tried to reverse engineer how google decides which
paths to follow in the main crawl, but without success.
I can tell you that it does crawl many pages that end
with a PR1 ranking, so it doesn't stop at (PR-Home - N)
or something like that.

Has anyone ever built a site map that points to 500k pages
as spider food?

Grumpus

4:36 pm on Jan 20, 2003 (gmt 0)

HSKFun - I'm gonna guess, but I'm usually right. If your crawled pages went down, look for your PR to drop by one in the next update. It's happened to me. Interesting, it had little or no effect on the amount of overall referals, but rather, merely limited my representation within the SERPS.

I've got my site set up so that you can hit every single page through a standard hyperlink and it's therefore crawlable. I've banned a good number of pages (i.e. I don't let the bot crawl my "quotes and trivia" page for any movie, but it DOES crawl the main details page for that movie which has a link to the quotes page). I'd guess that on the database end of my site (not counting the store side, at all) there are roughtly 600K pages that the bot is invited to crawl. If my PR ever got high enough, I'd like to hope that it might crawl all of them, but I think "time" becomes the limiting factor at some point. In order to get more pages in the time allowed for the crawl it'd have to be more aggressive and therefore crash your server more frequently and therefore end up getting fewer pages in the end.

Unversed

6:25 pm on Jan 20, 2003 (gmt 0)

I've banned a good number of pages (i.e. I don't let the bot crawl my "quotes and trivia"...

So does there come a point when you need to use robot.txt to steer the 'bot away from pages that you think are less valuable entry points to your site?
How do you know when you've reached that point?
Does this technique run the risk of teaching the 'bot to lose interest earlier?

WebGuerrilla

6:40 pm on Jan 20, 2003 (gmt 0)

Response time is critical.

Twice last year, we moved a large dynamic sitefrom an older box to a brand new box running dual Xeons. In both cases, the number of pages indexed by Google doubled.

The quicker you can serve the pages, the more pages they can grab in the amount of time alllotted for your site.

georgeek

7:48 pm on Jan 20, 2003 (gmt 0)

Would some kind person please sticky me with some examples of kind of sites you are talking about i.e. 40,000+ pages.

Not the obvious ones of course.

TIA

hskfun

8:30 pm on Jan 20, 2003 (gmt 0)

Thanks WebGuerrilla for the tip. I'll make sure
response time is better for the next crawl.
From my log files, previous deep crawls went on
for a longer period of days. Any insight into
how the crawl is ordered?

Grumpus: I'll let you know on the next update if my
home page PR goes down. Hope not.

Every page on my site can be crawled through
a static link, but some are very deep (i.e. a bot starting
at the top would have to follow 10 levels to get to
the leaf page.) My site map question was related to
whether anyone had created a site map that would let
the bot find all 500k pages only 4 levels down
(site map page has 100 links to 100 pages, each of
which have 100 links to 100 pages, etc.)