I think google reached its ID capacity limit? - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

I think google reached its ID capacity limit?

«
1
2
3
4
5
»

re5earcher

12:04 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Google has reached it's data indexing capacity of 4,294,967,296 (2^32) URLs.

Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.

When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.

They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.

This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.

This procedure will require 1000 new page index servers and additional storage for temporary tables.

They are hoping to make this change gradually server by server.

The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.

Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.

[just a guess but who knows]

[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]

Napoleon

2:19 pm on Jun 7, 2003 (gmt 0)

>> What's the big fuss about... <<

Sorry... but in the search industry recent events are a BIG story. It is EXTREMELY important to a lot of people, many of whose livelihoods depend upon it.

If it's just a fuss to you, don't read the thread. Simple.

Yidaki

2:23 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>this is the right time to move from the 32 bit limit.

It's ancient!

>Google projected this change 5 years ago ...

Tell me the official source, please.

>This is what's happening guys, ...

Tell me the official source, please.

If you guess please flag your posts as a guess! See the flag of re5's #1 post!

iSeeker

2:26 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Assuming this theory to be correct, could the change to accommodate the extra data be carried out within 2 months? Some people are interpreting GoogleGuys clues regarding the next update to be around 7 weeks. Is a period of 2 months not too short for such a drastic change?

Yidaki

2:32 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

[just a guess but who knows]

Yah, who knows ....

>Is a period of 2 months not too short for such a drastic change?

Ask GoogleGuy - afaik, he's the one at the google plex who has to frequently check the pages within the index - at least at the beginning of a new update. Don't know how long it takes him to count all 3,083,324,652 pages.

Brett_Tabke

2:37 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

>Brett, we don't know who GoogleGuy is either

Yes we do, but choose to respect his anon, but valid status as a google rep here on his own time - often from home - to contribute to the community. eg: you don't become a senior member here with his caliber of posts on a pr stunt or a lark.

>But I'm sure he could confirm or deny this.

He could, but he won't because that is inside mission critical info that the competition shouldn't have.

Ummm, it's fairly easy to increase the size of a pointer given the core is reportidly done in ml...

Netizen

2:42 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Having no idea how Google actually does the update I would guess that the deep crawl data goes into a new database so that new database could have the id's defined in any way Google likes.

<doh>
OK, so I told myself I wouldn't get dragged into this thread
</doh>

worker

2:43 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Top Contributors Of The Month

My guess is that Googleguy will make an appearance in this thread soon, if there is no truth to it.

The search engine newsletters regularly post WebmasterWorld threads in their press, and if this is a hoax, Google will want to stomp on it before it gets out of hand.

Yidaki

2:47 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>mission critical info that the competition shouldn't have
>this is a bogus thread

Yah, i read somewhere that re5earch is from the search engine <snip>. Unfortunately i can't find the source again since google is changing it's results from one moment to the next, ya know.

PatrickDeese

2:51 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I have a feeling this is going to turn into one of those "640K ought to be enough for anyone" internet hoaxes that people will be citing as factual years from now --right alongside the "AOL and Intel Merger email beta test for $$$" forwards that have been circulating for at least 5 years.

Yidaki

2:58 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>Well, if you look at re5earch's post, you'll find he joined today,
>just to post this. Curious - if someone was a search engine
>fanatic, they'd already be a member. If someone wasn't, how
>come they found exactly the right thread to post it to?

Sorry, i didn't post a smiley to flag my post as a obvious joke.

>Do what GoogleGuy suggested - write content, not forum posts!

Sometimes writing forum posts is more fun - especially if things are getting confusing. ;)

Napoleon

3:02 pm on Jun 7, 2003 (gmt 0)

>> you'll find he joined today, just to post this. <<

So? If I had some burning info about Google where do you think I would post it? Member or no member. I don't think that has anything to do with it.

It's speculation, like all stories, until we have come form of confirmation. But it is interesting nonetheless.

I'm not quite sure why this thread has disappaeared from the Active List on my PC (it's still under Google News). Of course it could be a software error at WebmasterWorld, or it could be my PC, or it could be a conspiracy!

Jane_Doe

3:06 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Assuming this theory to be correct, could the change to accommodate the extra data be carried out within 2 months?

In my last job as an IT project manager, I worked at a shipping company that was running out of bills of lading numbers. Over time as the company grew, it developed more shipments than the largest number its BL field could hold.

I had the project to expand the BL number. It took a large team of people and was pretty complex. We had to identify and change every screen, every hard copy report, every transaction record for every internal and external application on every piece of code in the company at every location throughout the world, and change all of the software. Vendors, customers and government agencies who exchanged data with us electronically also had to make changes in their software to match the increased BL field size. Sometimes people had lost the source code for programs so whole programs had to be rewritten or reconstructed from the load modules.

At a shipping comany the bill of lading number is everywhere. Just the analysis took months because not all the field names were standardized and every system called the BL field something slightly different, so even finding all of the places to change took a long time.

I don't know if it is true that Google is having the same type of problems, but if they are, this kind of project usually does not have a quick fix. The only interim fix is to reuse numbers as soon as records get deleted, but that only helps so much.

While such a revelation would certainly explain a lot - it is still inconceivable that Google would stake its very existence on such a huge mistake.

A shipping company with net worth in the billions of dollars and a huge IT department almost had to stop accepting new business because of a problem like this. Then they had smaller but somewhat similar problem with Y2k a few years later! So I can see it happening anywhere pretty easily.

Allergic

3:10 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Possible but very surprising after reading that article, (back in january 2003), where Brin hopes to expand to catalog the Web's entire 10 billion pages!

[webmasterworld.com...]

eaden

3:24 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

While this could be a bogus thread, I do not find it that hard to believe. And the 640K quote from BillG is relivent, as it demonstrates that people only look so far ahead.

I think looking back a the original google paper, there are some interesting quotes : "It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents." - and they call those numbers extra-ordinary. So quaddruple that number, and you have 4 billion. A safe bet in 1996, but the question is has google changed the size of docID since then.

The bit sizes of the 20th century google database are here :
[www-db.stanford.edu...] - It mentions 4, 5 and 8 bits for various data.

IITian

3:31 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

eaden: You rightly pointed out that 640K is relevant. Ironically the original poster meant it to mean just the opposite!

Macguru

3:32 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

As far as I know, Google are using *nix servers clusters and custom code written in Python to run the show.

I believe those servers still use, 32 bits processors. The only way to expand the amount of "records" in the database would be to use long integers. Doing this would slow down the whole thing.

I believe this theory is technically possible.

Netizen

3:54 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Solaris/SPARC is 64 bit, but I don't think that is relevant to the size of a docID in the Google database - that is more about the amount of memory an application can address (32-bit = 4GB memory)

Yidaki

3:59 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>The only way to expand the amount of "records" in the
>database would be to use long integers. Doing this would
>slow down the whole thing.

How do you come to this conclusion? Using longint vs. int doesn't slow down things at all, at least afaik.

DerekH

4:09 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I was going to write a lot more about this commotion about google falling flat on its face when it reached 4 billion pages (4 gigapages) but then I had a look at my hard drive and it's just about got 4 gigs on it.

At once, I was overcome by a sick feeling welling up in the pit of my stomach - aarrghhh! - I'm going to hit my own 4 gig limit and my precious little world on my computer will end.

Or perhaps it won't - perhaps I can go over 4 gigs and still retrieve my data from my disc...

And perhaps google will be the same.

But I love these "thought virus" infections the forum keeps suffering with - almost a kind of, if I can use the word, spam content in an otherwise good forum!

Yes, what this forum needs is a good spam filter. Perhaps google have some ideas on that <grin>

DerekH

Clark

4:09 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I find it very hard to believe that google would make a mistake of this magnitude. They monitor closely how many pages are in their index. They must know their capacity and are not that stupid.

It probably is a bit complex to change the size of some fields, but not THAT complex that it has to screw up deepbot.

Macguru

4:30 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Netizen,

I know Solaris/SPARC is 64 bit wide, and I use 128 bit wide processor myself right here on my desk. But I am not sure at all Google uses Solaris/SPARC servers.

I always thought 4 billion was the limit of a word that can be computed without overflow on a 32 bits chip that uses Complete Instruction Set Computing because the the dynamic range extends to 4 billion.

Yidaki,

I came up to this conclusion by reading Python newsgroups since a while. I never ran into anything close to 4 billions records database myself. ;)

[edited by: Macguru at 4:35 pm (utc) on June 7, 2003]

killroy

4:31 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

having developed a database engine myself, and written a web directory (the largest in this country) using for character digits (i.e. 0000-9999) and hitting hte barrier, I can only testify that this sort of problem happens more easily then expected.

Furthermore designing a new system, and makign the decision to use 3bytes or 4bytes for IDs, was very har, and settled on 3bytes at last (24million IDs) you may thing that is stupid, but once you've worked with a large db you will realise every bit saved counts. And google is a large DB indeed.

Also I'd like to point out, this has nothing to do with 32bit limits, memory address limits or CPU bandwidth, as you're not going to enumerate the IDs in memory, heck they're not even on the same computer or even country. Also you'Re not really doign a great deal of maths on the IDs themselves, so the CPU performance of 32bit vs 64bit hasn'T got much to do with ID lengths. This biggest problem is that you have to store the damn things all over the place. It's the only bit of data that get's duplicated a lot, everywhere the documents need to be refered, such as logs, linking info, keywords and others. so adding 1byte to the IDs, could add 4GB*x t oeach and every server, potentially, and I'd see htat as the real problem for an "upgrade" of such a large DB.

You might be able to hold out some with splittign techniques, where say, all documents in a certain region have 4byte IDs and you duplicate these IDs in various "regions" sort of good old DOS segmentation 8now perheps you understand why THAT was done) but it's complex and ads a lot of problems of it's own.

While I've not heard of this from anywhere, from my own experience, I can vouch that it can very well happen, although I expected Google to be smarter then me ;)

SN

BigDave

4:47 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I think this is a case where someone came up with a possible reason for the issues with the update, that could possibly explain things and started passing it around.

Sure, it makes sense. But it is not the reason for dominic. As others have said, changing the width of a single field is trivial. it would not have required the propigation of the entire DB to all the datacenters. This could be down at each datacenter in a fraction of the time.

*IF* this was an issue (which I doubt) then Google decided to take this opprotunity to change a bunch of other data to make adding future capabilities even easier.

eaden

4:51 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

>> As others have said, changing the width of a single field is trivial.

It's not a single field.

Macguru

4:52 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>As others have said, changing the width of a single field is trivial.

It may be, but will have dramatic consequences on a DB of this size. Something users would have noticed right away.

Jane_Doe

5:01 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

>As others have said, changing the width of a single field is trivial.

Not when it is used in 10,000 places (by 1,000 different names) throughout a company and its partner applications. Changing a field on a database isn't where all of the work is. For example, you can change a database field from 7 to 8 digits, but if all of the screens that use that field only allow for 7 positions, the number will be truncated when it appears on the screen. Or if the temproray fields within programs that refer to that field only have 7 digits, then the number will get truncated internally somewhere.

killroy

5:27 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Jane_Doe> I doubt that the usual DB design arguments hold in the case of google.

In fact I can't think of many places where people have to directly deal with document IDs, especially since its an automated system, where IDs could be dropped and documents re-added with new IDs each update.

As I said earlier, for a system this size, storage and indexing space is a major issue. After all IDs are needed in indexes, as well as all relations. Havign it in an index (or several) multiplies that extra byte already by a factor, and of course again by 4 billion. Now 4GB might not be an issue, but 4GB*4 or *10 might, especially if large fractions are needed to be stored on each amchine.

Imagine if 4GB for IDs are stored twice on each machine, that's 32GB (4bytes*2*4billion) now add 1 byte to each ID you suddenly get another 8GB, that's a significant fraction of a 90GB disk. now Ids aren't stored on each computer, but in all relations, and every record in each table will somehow be linked to a document.

SN

Macguru

6:41 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>Well as I understood it, nobody said it was THE reason for it.

That is how I see it too. But sounds like a pretty valid theory. Especially considering those facts :

1)4,294,967,296 is a 32 bit value limit

2)Google's home page claims "Searching 3,083,324,652 web pages" since a while plus non standard documents.

3)Back in Januany, some Google founder claimed he hoped to expand to 10 billion records this year

This plus some other facts lead to conclude they simply ran out of space. I am not sure about if they hit a integer overflow during the process. Maybe they had a major glitch trying to upgrade.

As bolitto asks in msg # 5

Anyone care to sit down and think why the last crawl was LOST?

IITian

6:47 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

I agree with killroy. Url ID is not just another field, it is the primary key and is stored all over the place including in major relational tables. My guess is when this database was designed 4 billion seemed like infinity.

DerekH

7:00 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

macguru wrote
"3)Back in Januany, some Google founder claimed he hoped to expand to 10 billion records this year

This plus some other facts lead to conclude they simply ran out of space. "

Although I suppose I can *follow* your argument that they've run out of space, I don't subscribe to it at all. Indeed, if a Google founder claims they hope to expand to 10 billion this year, it sounds like they already have a structured, planned way forward. Else why say it?

I really don't understand why everyone is suddenly having a panic - this post wasn't even in the back of anyone's mind 24 ours ago, and now people are saying that it's a major crisis and there's no way forward and the Google is going to collapse and and and...

Well, GoogleGuy, if you 've got any sense, you'll be sitting with your feet up enjoying the ride. Here in the UK, it's about time to pour a nice glass of chilled white wine and put my feet up. I hope GoogleGuy does the same in a few hours.

Such panic - and from nowhere!

Yours, with a content and panic-free smile
DerekH

This 128 message thread spans 5 pages: 128

«
1
2
3
4
5
»