I think google reached its ID capacity limit? - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

I think google reached its ID capacity limit?

1
2
3
4
5
»

re5earcher

12:04 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Google has reached it's data indexing capacity of 4,294,967,296 (2^32) URLs.

Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.

When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.

They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.

This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.

This procedure will require 1000 new page index servers and additional storage for temporary tables.

They are hoping to make this change gradually server by server.

The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.

Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.

[just a guess but who knows]

[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]

tibscl

12:09 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

I'm new around here. Who are internal sources?

eaden

12:11 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

That 3,083,324,652 number has been like that for ages. And since then, new pages have been added.

So you just got that number off the homepage?

vincevincevince

12:16 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

if what you say is true it's very scarey. i guess people will start "ID hogging" by creating vast numbers of low content placeholder pages so that future content additions will be indexed. i wonder what GoogleGuy will say about this?

Macguru

12:18 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Welcome to the board re5earcher!

It's been weeks since we did not have any REAL Google News around here.

Thanks!

>>Who are internal sources?

Someone who does not want to be identified at Google, I guess...

vincevincevince

12:20 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

sorry to shed doubt on this... but surely google could predict a year ago index growth rates and would have started this upgrade then? rate of indexing new pages is surely not something they can't control?

chiyo

12:27 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

i just assumed this post was a weak joke.

dmorison

12:49 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Well if they get rid of the (2^28) 404 pages currently indexed there would be plenty of numbers left...

:)

john316

12:50 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If google lost a billion pages, who would notice? You can't get past 1,000 results anyway.

mil2k

12:58 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

i just assumed this post was a weak joke.

Is it not? :)

bridge98

1:01 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

I'm not a good programmer with DBs.
So here are my thoughts about the posted figures

A 5 byte id field would give 1048575 unique id's. Which as far as I know means that only 1048575 urls would be indexed using such a system.

The current index has 3,083,324,652 pages (urls) which requires 8 bytes to store all the unique ids. And this means that there is still room for 1,211,642,643 urls to be indexed before Google have to increase the number of bytes to store unique ids.

This is assuming the ids are stored in simple binary form.

Perhaps a DB pro can provide us we more exact figures.
Whoops, sorry mixed up my bits and bytes :(

[edited by: bridge98 at 1:20 pm (utc) on June 7, 2003]

bolitto

1:13 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

>A 5 byte id field would give 1048575 unique id's.

A 5 byte field would yield 2^40 possibilities, simplifying it would mean approx. 4 billion times 256.

>The current index has 3,083,324,652 pages (urls) which requires 8 bytes to store all the unique ids

8 bytes would yield 2^64 possibilities, which would be 4 billion times 4 billion.

>This is assuming the ids are stored in simple binary form.

Until we have working fuzzy logic computers binary form is the only way to store anything.

>i just assumed this post was a weak joke.

Nope! I have heard a similar rumour from a reliable source.

Anyone care to sit down and think why the last crawl was LOST?

Because it overflowed 4 billion(the 32 bit computing limits for an integer) and very relevant pages were left off.

Sounds like a hoarde of PhD's can make very trivial mistakes as well.....anyway, there are more PhD's in Microsoft's reception hall than on all of Google. That don't mean necessarily good products....

[edited by: bolitto at 1:16 pm (utc) on June 7, 2003]

takagi

1:14 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

A 5 byte id field would give 1048575 unique id's.

Perhaps a DB pro can provide us we more exact figures.

I'm not a 'DB pro' but I don't understand what the number of 1048575 has to do with 5 byte. With 5 bytes you have 5x8 = 40 bits. That means 2^40 = 1,099,511,627,776 different IDs.

dmorison

1:18 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There are 10 types of person in this world. Those who understand binary, and those who don't.

Brett_Tabke

1:20 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

No offense, but where is this info coming from?

John_Creed

1:26 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

This sounds like someone's idea of a hoax. I doubt it will get any mileage.

eaden

1:26 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Brett, we don't know who GoogleGuy is either. But I'm sure he could confirm or deny this.

Macguru

1:29 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>But I'm sure he could confirm or deny this.

eaden, I see you are expecting a lot from Google's puplic relations.

john316

1:32 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>There are 10 types of person in this world. Those who understand binary, and those who don't.

Okay, I'll admit it, I don't. What I would like to know is; How do you get 3 billion of anything on an 80 gig hard drive?

takagi

1:37 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

.. ATW stopped showing their current page counts a while back

Last update of the ATW counter was only 2 months ago (March 1st) from 2,112,188,990 to 2,142,833,819. But maybe they use a signed long (2^31 = 2,147,483,648).

Macguru

1:38 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>How do you get 3 billion of anything on an 80 gig hard drive?

john316, last thing I have heard is that Google database was stored in RAM. Doesn't it make you drool?

Yidaki

1:39 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>where is this info coming from?

Right from the guys at data center ncc 1701!

However, everybody who works with large databases that have to be scalable knows that one of the first things to layout is a architecture that avoids id length overflow (using combined id indizes etc. ...).

Allthough some people say, the phd's at google have a iq of monkeys, i'm convinced they don't. So this "news" is a funny joke and a blind guess (didn't re5 even mention that?! ) - nothing more.

Yeah, re5, give 'em some bread crumbs and let 'em discuss about what bread it was .. he, he, he ... :)

netguy

1:48 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

On November 6, 2002 Google posted '3,083,324,652 web pages' on its site. If there is any truth to this post, Google has had more than 7 months warning to introduce stronger spam filters, 404 eliminations, etc. - as it concurrently developed a long-term solution.

While such a revelation would certainly explain a lot - it is still inconceivable that Google would stake its very existence on such a huge mistake.

Chris_D

1:50 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Re5earcher is hiding outside, waiting for the kerboom.

You know - you fire up the Illudium Q-36 Explosive Space Modulator - run outside, put you hands over your ears - and wait for the Kerboom.

25 responses in 1.5 hours is a pretty good kerboom re5earcher - you can come back in now....

IITian

1:51 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

takagi: I believe by reading your earlier posts that you have all the publicly available data for pages indexed over time. What do you think about the projected number of current pages from that data for G? (I was not paying much attention to ATW - looks like they are still okay.)

mat

1:52 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

Ah, but be sure to leave a good breadcrumb trail to follow when you come back in, else those durn acid enemas and PIR flash-bangs'll get ya.

Yidaki

2:00 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>I've created a few ram disks on the mac, and watched apps fly...very impressive.

Oh yes! john, i'm dreaming since a long time to run my databases right from a ram disc. Unfortunately i'd need a 12 gig ram disc ... and a macos that could handle it. :(

Napoleon

2:08 pm on Jun 7, 2003 (gmt 0)

The problem is that there is a problem. We haven't had an update in yonks, and are running on a core winter dbase in summer. There is still no sign that an update in round the corner, and Deepbot hasn't been seen in months.

Consequently, sooner or later, speculation was bound to start. Questions have been ongoing for weeks (we've all seen the 'Google is broke' threads).

No, I have no idea whether this story is correct or not. The problem is that against a background which gives every indication of problems, an allegation of a specific nature cannot be easily dismissed, especially when it sounds tenable and could explain some of the current non-happenings.

Yes, Google's or GoogleGuy's response would be very interesting.

bolitto

2:12 pm on Jun 7, 2003 (gmt 0)

10+ Year Member

It was probably planned since the very start and this is the right time to move from the 32 bit limit.

Google projected this change 5 years ago, but now is the time to do it and while it happens the index is a mess....

What's the big fuss about...Google is changing and it's happening on every production server for the public to see because it HAD to happen on the production servers some day.

Yeah they lost the last crawl because they didn't expect it to overflow now, so NOW is the time to upgrade to a system they already have been testing for a long time.

This is what's happening guys, what's the all emotional stuff about?

DerekH

2:19 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Although allanp73 said that April Fool's day was ages ago, I still think it's the greatest joke I've heard for a long time!

Delaying announcing the hoax for as long as the Dominic Update has over-run is a stylish addition...

Nicely done - almost a British sense of humour there!

Regards from Britain...
DerekH

This 128 message thread spans 5 pages: 128

1
2
3
4
5
»