|Google now twice as hard?|
Now that Google has nearly doubled it's index to 8,058,044,651 pages, does this now mean we have twice as much competition to optimise against?
If so, thanks a big one G!
You comments appreciated
No, it is just the opposite - easier to optimize for now that all those pr1's and pr2's that never made it into the index are now there...think about it.
you're on the same place. If they needed a deep, deep crawl to even be seen, those pages are no threat to you.
|If they needed a deep, deep crawl to even be seen, those pages are no threat to you. |
I'm not sure crawling was the issue. The 4 billion-site threshold was likely a byproduct of using 32-bit numbers to keep track of pages. They just couldn't index more than 4,294,967,296 pages; there was no straight-forward way to reference more than that.
I don't know whether Google has gone to a 64-bit design (1.84*10^19 indexable pages) or is now using sets of 32-bit caches in each data center. Or they could even be using some other methodology.
BTW, to actually keep track of a googol of pages (10^100), they would need to be using 333-bit numbers. :) Not to mention, we webmasters would need to get a LOT more prolific.
If they are looking for so many more pages, deep crawls etc then they could always try adding all my pages to their index - a couple of sites have 100's of pages not included
GOOGLE NEVER USED A 32-BIT NUMBER TO REPRESENT DOCUMENTS. READ THE ORIGINAL WHITE PAPER.
They should have worried about providing good results for the first 4 billion before expanding. They can add 8 trillion pages and it won't matter if their results are all crap.
I took your advice and reviewed the original white paper. Figure 3, associated with sections 4.2.5 and 4.2.6, would seem to indicate the docIDs of the Stanford era of Google were 27 bits in length, not the 32 I had so rashly presumed. Five bits were being used for some sort of hit count ("nhits").
When they stalled out at 4 billion pages, I always felt it was safe to assume they had a 32 bit probem somewhere in their system.
Google has been steadily increasing the number of pages indexed. They just haven't bothered to update the number on their front page.
If you look at the history of that number, I think the quickest they updated it was 3 months, and 6+ months is not unheard of.
As for the 32-bit myth, do you really think that it would take them all that long to fix that if it was a problem?
They have some mighty fine hackers at the plex, and I would be incredibly shocked of they have not had a typedef of the index value in use from the start. Combine that with code reviews and it is extremely likely that all that would be involved is changing that one line of code, recompiling, rebuilding the index, testing and shipping the new index and code to the datacentres.
Of all those steps, shipping out the new index would take the longest, a couple of weeks at most.
I thought this myth was buried months ago.
27 bits would be only 132 million pages. :)
Close review of the original white paper will indicate that the docID size and/or length isn't even specified.