Page is a not externally linkable
pageoneresults - 3:18 pm on Jul 26, 2006 (gmt 0)
Bot 1 rips through the site and will index anything and everything. This bot is smart enough to follow sort queries and all sorts of other stuff that will cause issues. Bot 2 comes around (at a later time) and does a comparison for dup content. It now has to determine which of the dup content to keep. Which one it keeps seems to be related to the number of inbound links and/or PR the page has. I think in Wiki's case, you'll see duplicate listings appear sporadically. It takes a bit of time for Googlebot to process all of that data and "do the right thing". I think that's why we see such wild fluctuations in the page counts when doing site: searches. Google is "continually" merging and purging. ;) In referece to case issues, Google's smart and understands that there could be a case sensitive URI structure. So, it ends up indexing both upper and lower case versions but will eventually purge one of them, usually the upper case version unless of course you are case sensitive. This is where harnessing the bot comes into play. Preventing the indexing of sort queries, case issues, anything that should NOT be getting indexed and/or followed. There are all sorts of ways to implement these strategies too. You gotta be careful though! ;) Think of it this way, let's say that you only had one chance of getting Googlebot to index your site once a month. And, let's say that there was a limit on the number of pages it would index. Wouldn't you want to make sure that the bot was not bouncing around your site generating sort queries, etc?
Here's the way I see Googlebot working on a large site like Wiki...