Forum Moderators: open
If they needed a deep, deep crawl to even be seen, those pages are no threat to you.
I don't know whether Google has gone to a 64-bit design (1.84*10^19 indexable pages) or is now using sets of 32-bit caches in each data center. Or they could even be using some other methodology.
BTW, to actually keep track of a googol of pages (10^100), they would need to be using 333-bit numbers. :) Not to mention, we webmasters would need to get a LOT more prolific.
-- Rich
I took your advice and reviewed the original white paper. Figure 3, associated with sections 4.2.5 and 4.2.6, would seem to indicate the docIDs of the Stanford era of Google were 27 bits in length, not the 32 I had so rashly presumed. Five bits were being used for some sort of hit count ("nhits").
-- Rich
If you look at the history of that number, I think the quickest they updated it was 3 months, and 6+ months is not unheard of.
As for the 32-bit myth, do you really think that it would take them all that long to fix that if it was a problem?
They have some mighty fine hackers at the plex, and I would be incredibly shocked of they have not had a typedef of the index value in use from the start. Combine that with code reviews and it is extremely likely that all that would be involved is changing that one line of code, recompiling, rebuilding the index, testing and shipping the new index and code to the datacentres.
Of all those steps, shipping out the new index would take the longest, a couple of weeks at most.
I thought this myth was buried months ago.