lucy24 - 11:13 pm on Feb 4, 2013 (gmt 0)
That is a darned interesting question. What is even more interesting is that when I tested it for myself, the number of hits in a search was bigger than the "total number indexed".
Let me lay out my actual numbers and then people can compare.
"Total indexed": 232*
Search for site:example.com with blank search terms: "about 264" results, of which 202 were displayed, ending with
In order to show you the most relevant results, we have omitted some entries very similar to the 202 already displayed.
If I now go to "include omitted results", the total jumps to... not 264 but 209.
As a follow-up, I did the same thing, this time giving "the" as search string. Total now drops to 246. (I have some pages that are not in English, but definitely not eighteen of them, let alone eighteen that don't include the word "the" in boilerplate somewhere.) This time the displayed total is 203, with no "very similar" or "repeat search" option.
At this point curiosity got the better of me and-- you knew this was coming, didn't you?-- I saved those 14 pages (30 items per page) from the two versions of the empty search for offline scrutiny.
The first extra comes in on pg3; it's the pdf version of an html page. The second and third are on pg5; both are in roboted-out directories. (I don't mind if people know the pages exist, I just don't want the content indexed.) But at this point there was some further glitch, because the with-omitted-pages version of pg5 has only 27 items. This neatly accounts for three more duplicates-- again pdf versions of html pages. The seventh is also absent, although I can now figure out what it should be.
Further interesting question: When you feed in nothing but a site name, how does the algorithm decide what order to list things in? The top-level index pages come first-- with one interesting exception. These are followed by some-- but not all-- of the pages that come up most often in searches. After that, it's anyone's guess. As are the out-of-expected-sequence pages.
* About a month back, there was an interesting hiccup where the "not selected" number went sharply down while the "total indexed" number went slightly up. All other numbers ("ever crawled", roboted-out and I forget the rest) stayed the same. So I'm left with about twelve pages that simply vanished into the ether, disappearing from Not Selected but not resurfacing in Indexed.