Forum Moderators: open
What are the factors that determine how many pages get crawled?
Here are the ones I can think of:
PageRank - I have a PR5 right now.
Page Response Time (my site is db driven and sometimes there are db issues)
Any other factors? Thoughts?
Thx
Impressive.
The PR of the individual pages is probably what is most important.
Check your page response times from other places besides you own. There are lots of places on the web that will do remote trace routes and such.
With a PR5 on your main index page - I would be happy to get what you are getting now....
The question, here, though, shouldn't be "how do I get it to crawl more?" but rather, "How do I get it to crawl the right stuff?" I'm sure you've got a ton of pages that you consider worthy, but in reality there are only certain ones that are really going to generate traffic and sales. Hot topics come and go in searches.
It took me six months to nudge the bot in the right direction to crawl the "hot stuff" first and then go through the rest of it as backfill. Just creating a "new and updated" page as a map won't do it because Google won't really KNOW that that is your "new and updated" page. You've also got to work out a way to teach it that that's the page it needs to use as it's seed for the crawl. This is done through internal linking structure. I imagine that most people here would call me crazy when I say that I actually had to make a conscious effort to LOWER the PR of various pages on my site just so Google would know where to crawl.
The site in my profile is the one I'm talking about if you want to have a look, but it's a lot more than just building the pages. Link structure (and quantity) is key. I wish I could be more specific for you, but it's rather a black art that I don't really fully understand myself. Keeping an eye on the habits of the "freshbot" though, will definitely help you decipher the logic (or seeming lack of logic) of the main crawl's direction.
G.
I have no idea if things have changed since, but a few months ago, I had a site which had PageRank entering only through the home page, and had various paths leading deeper into the site. Sections of the site where the hierrarchy 'ancestors' had fewer links were crawled deeper than sections where the hierrarchy 'ancestors' had more links.
Googlebot's behaviour was as though it stopped spidering at approximately the same position on the Toolbar PageRank scale (roughly four notches off the bottom). It's possible that there's a 'didn't finish spidering in time' limit as well as a seperate PageRank limit but it seems more likely that the latter is a function of the former.
This was for pages with the same server response time; others have found that the response time makes a significant difference.
Flex.
The key here isn't to increase the quantity of pages that it'll crawl in a month. That comes naturally as you increase your inbound links and your PR goes up. The amount of work involved in going from PR5 to PR6 is a lot.
The key IS to nudge the bot toward the pages that are going to get traffic and/or sales. For me, my site's about movies and soundtracks so I pretty much know a month ahead of time which pages I need Googlebot to crawl because they'll be HOT items that month. For other sites, it may not be as easy to predict what the hot topics would be (For example, how the heck would I know that Pete Townshend was going to be a hot topic right now?)
Figuring out which pages are hot, then getting the bot to get all of those is a lot of work, but it's far less work and a lot more effective than trying to get the bot to crawl more/deeper. Once one masters that (well, you probably can't MASTER it, but once you stumble on the linking/navigation formula that seems to do the trick) then it really doesn't matter that 80% of the site content doesn't get crawled. The stuff that people are going to be looking for frequently DID get crawled. Then they bookmark my site and come back to me directly to search for those hard to find and less popular things. :)
G.
I have a slightly different strategy about getting pages indexed though. Some of our "hot" pages get about 50-100 hits a day, but overall, the "hot" pages probably contribute 20% or less to the overall google search hits. MANY (75%+) of my hits are "one-off" on the tens of thousands of other indexed pages. So I have to make sure that as much of this ground gets "covered". So basically, i have a paged index of links into the site (in a Yahoo directory structure, sorta).
I've come to the conclusion that I can't worry about internal pagerank - I have internal pages with PR1 that get hit more than PR4. It seems to do with the page title, honestly. I'd rather have more pages in the google index that are lower PR.
Thanks!
G.
In terms of response time, I've certainly had googlebot
crash my site (returning several dozen 500 errors),
and googlebot seems to lay off for awhile, but then
comes back and grabs more pages. No correlation
that I can see between the crawls in which googlebot
has had to take a break, and the ones that were clean.
As grumpus says, you can clearly help googlebot understand
what pages you want crawled, by putting
those links on your high PR pages. For me,
I may have 100k favorite pages, but
I'm still trying to figure out how to get all 500k pages
in the index. I have seen sites with a PR7 with twice
as many pages in the index as mine. Perhaps they have
more links to interior pages than I ... not sure.
I have tried to reverse engineer how google decides which
paths to follow in the main crawl, but without success.
I can tell you that it does crawl many pages that end
with a PR1 ranking, so it doesn't stop at (PR-Home - N)
or something like that.
Has anyone ever built a site map that points to 500k pages
as spider food?
I've got my site set up so that you can hit every single page through a standard hyperlink and it's therefore crawlable. I've banned a good number of pages (i.e. I don't let the bot crawl my "quotes and trivia" page for any movie, but it DOES crawl the main details page for that movie which has a link to the quotes page). I'd guess that on the database end of my site (not counting the store side, at all) there are roughtly 600K pages that the bot is invited to crawl. If my PR ever got high enough, I'd like to hope that it might crawl all of them, but I think "time" becomes the limiting factor at some point. In order to get more pages in the time allowed for the crawl it'd have to be more aggressive and therefore crash your server more frequently and therefore end up getting fewer pages in the end.
G.
So does there come a point when you need to use robot.txt to steer the 'bot away from pages that you think are less valuable entry points to your site?
How do you know when you've reached that point?
Does this technique run the risk of teaching the 'bot to lose interest earlier?
Twice last year, we moved a large dynamic sitefrom an older box to a brand new box running dual Xeons. In both cases, the number of pages indexed by Google doubled.
The quicker you can serve the pages, the more pages they can grab in the amount of time alllotted for your site.
Grumpus: I'll let you know on the next update if my
home page PR goes down. Hope not.
Every page on my site can be crawled through
a static link, but some are very deep (i.e. a bot starting
at the top would have to follow 10 levels to get to
the leaf page.) My site map question was related to
whether anyone had created a site map that would let
the bot find all 500k pages only 4 levels down
(site map page has 100 links to 100 pages, each of
which have 100 links to 100 pages, etc.)