Forum Moderators: open
Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.
When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.
They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.
This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.
This procedure will require 1000 new page index servers and additional storage for temporary tables.
They are hoping to make this change gradually server by server.
The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.
Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.
[just a guess but who knows]
[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]
A 5 byte id field would give 1048575 unique id's. Which as far as I know means that only 1048575 urls would be indexed using such a system.
The current index has 3,083,324,652 pages (urls) which requires 8 bytes to store all the unique ids. And this means that there is still room for 1,211,642,643 urls to be indexed before Google have to increase the number of bytes to store unique ids.
This is assuming the ids are stored in simple binary form.
Perhaps a DB pro can provide us we more exact figures.
Whoops, sorry mixed up my bits and bytes :(
[edited by: bridge98 at 1:20 pm (utc) on June 7, 2003]
A 5 byte field would yield 2^40 possibilities, simplifying it would mean approx. 4 billion times 256.
>The current index has 3,083,324,652 pages (urls) which requires 8 bytes to store all the unique ids
8 bytes would yield 2^64 possibilities, which would be 4 billion times 4 billion.
>This is assuming the ids are stored in simple binary form.
Until we have working fuzzy logic computers binary form is the only way to store anything.
>i just assumed this post was a weak joke.
Nope! I have heard a similar rumour from a reliable source.
Anyone care to sit down and think why the last crawl was LOST?
Because it overflowed 4 billion(the 32 bit computing limits for an integer) and very relevant pages were left off.
Sounds like a hoarde of PhD's can make very trivial mistakes as well.....anyway, there are more PhD's in Microsoft's reception hall than on all of Google. That don't mean necessarily good products....
[edited by: bolitto at 1:16 pm (utc) on June 7, 2003]
Right from the guys at data center ncc 1701!
However, everybody who works with large databases that have to be scalable knows that one of the first things to layout is a architecture that avoids id length overflow (using combined id indizes etc. ...).
Allthough some people say, the phd's at google have a iq of monkeys, i'm convinced they don't. So this "news" is a funny joke and a blind guess (didn't re5 even mention that?! ) - nothing more.
Yeah, re5, give 'em some bread crumbs and let 'em discuss about what bread it was .. he, he, he ... :)
While such a revelation would certainly explain a lot - it is still inconceivable that Google would stake its very existence on such a huge mistake.
Consequently, sooner or later, speculation was bound to start. Questions have been ongoing for weeks (we've all seen the 'Google is broke' threads).
No, I have no idea whether this story is correct or not. The problem is that against a background which gives every indication of problems, an allegation of a specific nature cannot be easily dismissed, especially when it sounds tenable and could explain some of the current non-happenings.
Yes, Google's or GoogleGuy's response would be very interesting.
Google projected this change 5 years ago, but now is the time to do it and while it happens the index is a mess....
What's the big fuss about...Google is changing and it's happening on every production server for the public to see because it HAD to happen on the production servers some day.
Yeah they lost the last crawl because they didn't expect it to overflow now, so NOW is the time to upgrade to a system they already have been testing for a long time.
This is what's happening guys, what's the all emotional stuff about?
Delaying announcing the hoax for as long as the Dominic Update has over-run is a stylish addition...
Nicely done - almost a British sense of humour there!
Regards from Britain...
DerekH