homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

Google now twice as hard?

 7:38 pm on Nov 11, 2004 (gmt 0)

Now that Google has nearly doubled it's index to 8,058,044,651 pages, does this now mean we have twice as much competition to optimise against?

If so, thanks a big one G!

You comments appreciated



 2:00 pm on Nov 20, 2004 (gmt 0)

No, it is just the opposite - easier to optimize for now that all those pr1's and pr2's that never made it into the index are now there...think about it.


 4:16 pm on Nov 20, 2004 (gmt 0)

you're on the same place. If they needed a deep, deep crawl to even be seen, those pages are no threat to you.


 5:42 pm on Nov 20, 2004 (gmt 0)

If they needed a deep, deep crawl to even be seen, those pages are no threat to you.

I'm not sure crawling was the issue. The 4 billion-site threshold was likely a byproduct of using 32-bit numbers to keep track of pages. They just couldn't index more than 4,294,967,296 pages; there was no straight-forward way to reference more than that.

I don't know whether Google has gone to a 64-bit design (1.84*10^19 indexable pages) or is now using sets of 32-bit caches in each data center. Or they could even be using some other methodology.

BTW, to actually keep track of a googol of pages (10^100), they would need to be using 333-bit numbers. :) Not to mention, we webmasters would need to get a LOT more prolific.

-- Rich


 6:10 pm on Nov 20, 2004 (gmt 0)

If they are looking for so many more pages, deep crawls etc then they could always try adding all my pages to their index - a couple of sites have 100's of pages not included


 9:51 pm on Nov 20, 2004 (gmt 0)




 4:05 am on Nov 21, 2004 (gmt 0)

They should have worried about providing good results for the first 4 billion before expanding. They can add 8 trillion pages and it won't matter if their results are all crap.


 4:18 am on Nov 22, 2004 (gmt 0)


I took your advice and reviewed the original white paper. Figure 3, associated with sections 4.2.5 and 4.2.6, would seem to indicate the docIDs of the Stanford era of Google were 27 bits in length, not the 32 I had so rashly presumed. Five bits were being used for some sort of hit count ("nhits").

-- Rich


 8:02 pm on Nov 22, 2004 (gmt 0)

When they stalled out at 4 billion pages, I always felt it was safe to assume they had a 32 bit probem somewhere in their system.


 8:21 pm on Nov 22, 2004 (gmt 0)

Google has been steadily increasing the number of pages indexed. They just haven't bothered to update the number on their front page.

If you look at the history of that number, I think the quickest they updated it was 3 months, and 6+ months is not unheard of.

As for the 32-bit myth, do you really think that it would take them all that long to fix that if it was a problem?

They have some mighty fine hackers at the plex, and I would be incredibly shocked of they have not had a typedef of the index value in use from the start. Combine that with code reviews and it is extremely likely that all that would be involved is changing that one line of code, recompiling, rebuilding the index, testing and shipping the new index and code to the datacentres.

Of all those steps, shipping out the new index would take the longest, a couple of weeks at most.

I thought this myth was buried months ago.


 11:40 pm on Nov 22, 2004 (gmt 0)

27 bits would be only 132 million pages. :)

Close review of the original white paper will indicate that the docID size and/or length isn't even specified.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved