Forum Moderators: open
Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.
When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.
They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.
This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.
This procedure will require 1000 new page index servers and additional storage for temporary tables.
They are hoping to make this change gradually server by server.
The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.
Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.
[just a guess but who knows]
[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]
Sorry... but in the search industry recent events are a BIG story. It is EXTREMELY important to a lot of people, many of whose livelihoods depend upon it.
If it's just a fuss to you, don't read the thread. Simple.
It's ancient!
>Google projected this change 5 years ago ...
Tell me the official source, please.
>This is what's happening guys, ...
Tell me the official source, please.
If you guess please flag your posts as a guess! See the flag of re5's #1 post!
[just a guess but who knows]
Yah, who knows ....
>Is a period of 2 months not too short for such a drastic change?
Ask GoogleGuy - afaik, he's the one at the google plex who has to frequently check the pages within the index - at least at the beginning of a new update. Don't know how long it takes him to count all 3,083,324,652 pages.
Yes we do, but choose to respect his anon, but valid status as a google rep here on his own time - often from home - to contribute to the community. eg: you don't become a senior member here with his caliber of posts on a pr stunt or a lark.
>But I'm sure he could confirm or deny this.
He could, but he won't because that is inside mission critical info that the competition shouldn't have.
Ummm, it's fairly easy to increase the size of a pointer given the core is reportidly done in ml...
Sorry, i didn't post a smiley to flag my post as a obvious joke.
>Do what GoogleGuy suggested - write content, not forum posts!
Sometimes writing forum posts is more fun - especially if things are getting confusing. ;)
So? If I had some burning info about Google where do you think I would post it? Member or no member. I don't think that has anything to do with it.
It's speculation, like all stories, until we have come form of confirmation. But it is interesting nonetheless.
I'm not quite sure why this thread has disappaeared from the Active List on my PC (it's still under Google News). Of course it could be a software error at WebmasterWorld, or it could be my PC, or it could be a conspiracy!
Assuming this theory to be correct, could the change to accommodate the extra data be carried out within 2 months?
In my last job as an IT project manager, I worked at a shipping company that was running out of bills of lading numbers. Over time as the company grew, it developed more shipments than the largest number its BL field could hold.
I had the project to expand the BL number. It took a large team of people and was pretty complex. We had to identify and change every screen, every hard copy report, every transaction record for every internal and external application on every piece of code in the company at every location throughout the world, and change all of the software. Vendors, customers and government agencies who exchanged data with us electronically also had to make changes in their software to match the increased BL field size. Sometimes people had lost the source code for programs so whole programs had to be rewritten or reconstructed from the load modules.
At a shipping comany the bill of lading number is everywhere. Just the analysis took months because not all the field names were standardized and every system called the BL field something slightly different, so even finding all of the places to change took a long time.
I don't know if it is true that Google is having the same type of problems, but if they are, this kind of project usually does not have a quick fix. The only interim fix is to reuse numbers as soon as records get deleted, but that only helps so much.
While such a revelation would certainly explain a lot - it is still inconceivable that Google would stake its very existence on such a huge mistake.
A shipping company with net worth in the billions of dollars and a huge IT department almost had to stop accepting new business because of a problem like this. Then they had smaller but somewhat similar problem with Y2k a few years later! So I can see it happening anywhere pretty easily.
[webmasterworld.com...]
I think looking back a the original google paper, there are some interesting quotes : "It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents." - and they call those numbers extra-ordinary. So quaddruple that number, and you have 4 billion. A safe bet in 1996, but the question is has google changed the size of docID since then.
The bit sizes of the 20th century google database are here :
[www-db.stanford.edu...] - It mentions 4, 5 and 8 bits for various data.
I believe those servers still use, 32 bits processors. The only way to expand the amount of "records" in the database would be to use long integers. Doing this would slow down the whole thing.
I believe this theory is technically possible.
At once, I was overcome by a sick feeling welling up in the pit of my stomach - aarrghhh! - I'm going to hit my own 4 gig limit and my precious little world on my computer will end.
Or perhaps it won't - perhaps I can go over 4 gigs and still retrieve my data from my disc...
And perhaps google will be the same.
But I love these "thought virus" infections the forum keeps suffering with - almost a kind of, if I can use the word, spam content in an otherwise good forum!
Yes, what this forum needs is a good spam filter. Perhaps google have some ideas on that <grin>
DerekH
It probably is a bit complex to change the size of some fields, but not THAT complex that it has to screw up deepbot.
I know Solaris/SPARC is 64 bit wide, and I use 128 bit wide processor myself right here on my desk. But I am not sure at all Google uses Solaris/SPARC servers.
I always thought 4 billion was the limit of a word that can be computed without overflow on a 32 bits chip that uses Complete Instruction Set Computing because the the dynamic range extends to 4 billion.
Yidaki,
I came up to this conclusion by reading Python newsgroups since a while. I never ran into anything close to 4 billions records database myself. ;)
[edited by: Macguru at 4:35 pm (utc) on June 7, 2003]
Furthermore designing a new system, and makign the decision to use 3bytes or 4bytes for IDs, was very har, and settled on 3bytes at last (24million IDs) you may thing that is stupid, but once you've worked with a large db you will realise every bit saved counts. And google is a large DB indeed.
Also I'd like to point out, this has nothing to do with 32bit limits, memory address limits or CPU bandwidth, as you're not going to enumerate the IDs in memory, heck they're not even on the same computer or even country. Also you'Re not really doign a great deal of maths on the IDs themselves, so the CPU performance of 32bit vs 64bit hasn'T got much to do with ID lengths. This biggest problem is that you have to store the damn things all over the place. It's the only bit of data that get's duplicated a lot, everywhere the documents need to be refered, such as logs, linking info, keywords and others. so adding 1byte to the IDs, could add 4GB*x t oeach and every server, potentially, and I'd see htat as the real problem for an "upgrade" of such a large DB.
You might be able to hold out some with splittign techniques, where say, all documents in a certain region have 4byte IDs and you duplicate these IDs in various "regions" sort of good old DOS segmentation 8now perheps you understand why THAT was done) but it's complex and ads a lot of problems of it's own.
While I've not heard of this from anywhere, from my own experience, I can vouch that it can very well happen, although I expected Google to be smarter then me ;)
SN
Sure, it makes sense. But it is not the reason for dominic. As others have said, changing the width of a single field is trivial. it would not have required the propigation of the entire DB to all the datacenters. This could be down at each datacenter in a fraction of the time.
*IF* this was an issue (which I doubt) then Google decided to take this opprotunity to change a bunch of other data to make adding future capabilities even easier.
Not when it is used in 10,000 places (by 1,000 different names) throughout a company and its partner applications. Changing a field on a database isn't where all of the work is. For example, you can change a database field from 7 to 8 digits, but if all of the screens that use that field only allow for 7 positions, the number will be truncated when it appears on the screen. Or if the temproray fields within programs that refer to that field only have 7 digits, then the number will get truncated internally somewhere.
In fact I can't think of many places where people have to directly deal with document IDs, especially since its an automated system, where IDs could be dropped and documents re-added with new IDs each update.
As I said earlier, for a system this size, storage and indexing space is a major issue. After all IDs are needed in indexes, as well as all relations. Havign it in an index (or several) multiplies that extra byte already by a factor, and of course again by 4 billion. Now 4GB might not be an issue, but 4GB*4 or *10 might, especially if large fractions are needed to be stored on each amchine.
Imagine if 4GB for IDs are stored twice on each machine, that's 32GB (4bytes*2*4billion) now add 1 byte to each ID you suddenly get another 8GB, that's a significant fraction of a 90GB disk. now Ids aren't stored on each computer, but in all relations, and every record in each table will somehow be linked to a document.
SN
That is how I see it too. But sounds like a pretty valid theory. Especially considering those facts :
1)4,294,967,296 is a 32 bit value limit
2)Google's home page claims "Searching 3,083,324,652 web pages" since a while plus non standard documents.
3)Back in Januany, some Google founder claimed he hoped to expand to 10 billion records this year
This plus some other facts lead to conclude they simply ran out of space. I am not sure about if they hit a integer overflow during the process. Maybe they had a major glitch trying to upgrade.
As bolitto asks in msg # 5
Anyone care to sit down and think why the last crawl was LOST?
This plus some other facts lead to conclude they simply ran out of space. "
Although I suppose I can *follow* your argument that they've run out of space, I don't subscribe to it at all. Indeed, if a Google founder claims they hope to expand to 10 billion this year, it sounds like they already have a structured, planned way forward. Else why say it?
I really don't understand why everyone is suddenly having a panic - this post wasn't even in the back of anyone's mind 24 ours ago, and now people are saying that it's a major crisis and there's no way forward and the Google is going to collapse and and and...
Well, GoogleGuy, if you 've got any sense, you'll be sitting with your feet up enjoying the ride. Here in the UK, it's about time to pour a nice glass of chilled white wine and put my feet up. I hope GoogleGuy does the same in a few hours.
Such panic - and from nowhere!
Yours, with a content and panic-free smile
DerekH