Welcome to WebmasterWorld Guest from 34.229.113.106

Forum Moderators: open

Message Too Old, No Replies

I think google reached its ID capacity limit?

     
12:04 pm on Jun 7, 2003 (gmt 0)

New User

10+ Year Member

joined:June 7, 2003
posts:5
votes: 0


Google has reached it's data indexing capacity of 4,294,967,296 (2^32) URLs.

Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.

When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.

They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.

This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.

This procedure will require 1000 new page index servers and additional storage for temporary tables.

They are hoping to make this change gradually server by server.

The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.

Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.

[just a guess but who knows]

[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]

12:09 pm on June 7, 2003 (gmt 0)

New User

10+ Year Member

joined:June 5, 2003
posts:10
votes: 0


I'm new around here. Who are internal sources?
12:11 pm on June 7, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 5, 2003
posts:380
votes: 0


That 3,083,324,652 number has been like that for ages. And since then, new pages have been added.

So you just got that number off the homepage?

12:16 pm on June 7, 2003 (gmt 0)

Senior Member from MY 

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 1, 2003
posts:4847
votes: 0


if what you say is true it's very scarey. i guess people will start "ID hogging" by creating vast numbers of low content placeholder pages so that future content additions will be indexed. i wonder what GoogleGuy will say about this?
12:18 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member macguru is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Dec 30, 2000
posts:3300
votes: 0


Welcome to the board re5earcher!

It's been weeks since we did not have any REAL Google News around here.

Thanks!

>>Who are internal sources?

Someone who does not want to be identified at Google, I guess...

12:20 pm on June 7, 2003 (gmt 0)

Senior Member from MY 

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 1, 2003
posts:4847
votes: 0


sorry to shed doubt on this... but surely google could predict a year ago index growth rates and would have started this upgrade then? rate of indexing new pages is surely not something they can't control?
12:27 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member chiyo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 21, 2000
posts:3170
votes: 0


i just assumed this post was a weak joke.
12:49 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2003
posts:1633
votes: 0


Well if they get rid of the (2^28) 404 pages currently indexed there would be plenty of numbers left...

:)

12:50 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2001
posts:1609
votes: 0


If google lost a billion pages, who would notice? You can't get past 1,000 results anyway.
12:58 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 25, 2003
posts:970
votes: 0


i just assumed this post was a weak joke.

Is it not? :)

1:01 pm on June 7, 2003 (gmt 0)

New User

10+ Year Member

joined:Aug 13, 2002
posts:10
votes: 0


I'm not a good programmer with DBs.
So here are my thoughts about the posted figures

A 5 byte id field would give 1048575 unique id's. Which as far as I know means that only 1048575 urls would be indexed using such a system.

The current index has 3,083,324,652 pages (urls) which requires 8 bytes to store all the unique ids. And this means that there is still room for 1,211,642,643 urls to be indexed before Google have to increase the number of bytes to store unique ids.

This is assuming the ids are stored in simple binary form.

Perhaps a DB pro can provide us we more exact figures.
Whoops, sorry mixed up my bits and bytes :(

[edited by: bridge98 at 1:20 pm (utc) on June 7, 2003]

1:13 pm on June 7, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:June 6, 2003
posts:67
votes: 0


>A 5 byte id field would give 1048575 unique id's.

A 5 byte field would yield 2^40 possibilities, simplifying it would mean approx. 4 billion times 256.

>The current index has 3,083,324,652 pages (urls) which requires 8 bytes to store all the unique ids

8 bytes would yield 2^64 possibilities, which would be 4 billion times 4 billion.

>This is assuming the ids are stored in simple binary form.

Until we have working fuzzy logic computers binary form is the only way to store anything.

>i just assumed this post was a weak joke.

Nope! I have heard a similar rumour from a reliable source.

Anyone care to sit down and think why the last crawl was LOST?

Because it overflowed 4 billion(the 32 bit computing limits for an integer) and very relevant pages were left off.

Sounds like a hoarde of PhD's can make very trivial mistakes as well.....anyway, there are more PhD's in Microsoft's reception hall than on all of Google. That don't mean necessarily good products....

[edited by: bolitto at 1:16 pm (utc) on June 7, 2003]

1:14 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


A 5 byte id field would give 1048575 unique id's.

Perhaps a DB pro can provide us we more exact figures.

I'm not a 'DB pro' but I don't understand what the number of 1048575 has to do with 5 byte. With 5 bytes you have 5x8 = 40 bits. That means 2^40 = 1,099,511,627,776 different IDs.

1:18 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2003
posts:1633
votes: 0


There are 10 types of person in this world. Those who understand binary, and those who don't.
1:20 pm on June 7, 2003 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38251
votes: 111


No offense, but where is this info coming from?
1:26 pm on June 7, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Apr 24, 2003
posts:216
votes: 0


This sounds like someone's idea of a hoax. I doubt it will get any mileage.
1:26 pm on June 7, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 5, 2003
posts:380
votes: 0


Brett, we don't know who GoogleGuy is either. But I'm sure he could confirm or deny this.
1:29 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member macguru is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Dec 30, 2000
posts:3300
votes: 0


>>But I'm sure he could confirm or deny this.

eaden, I see you are expecting a lot from Google's puplic relations.

1:32 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2001
posts:1609
votes: 0


>There are 10 types of person in this world. Those who understand binary, and those who don't.

Okay, I'll admit it, I don't. What I would like to know is; How do you get 3 billion of anything on an 80 gig hard drive?

1:37 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


.. ATW stopped showing their current page counts a while back

Last update of the ATW counter was only 2 months ago (March 1st) from 2,112,188,990 to 2,142,833,819. But maybe they use a signed long (2^31 = 2,147,483,648).

1:38 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member macguru is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Dec 30, 2000
posts:3300
votes: 0


>>How do you get 3 billion of anything on an 80 gig hard drive?

john316, last thing I have heard is that Google database was stored in RAM. Doesn't it make you drool?

1:39 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 8, 2002
posts:2015
votes: 0


>where is this info coming from?

Right from the guys at data center ncc 1701!

However, everybody who works with large databases that have to be scalable knows that one of the first things to layout is a architecture that avoids id length overflow (using combined id indizes etc. ...).

Allthough some people say, the phd's at google have a iq of monkeys, i'm convinced they don't. So this "news" is a funny joke and a blind guess (didn't re5 even mention that?! ) - nothing more.

Yeah, re5, give 'em some bread crumbs and let 'em discuss about what bread it was .. he, he, he ... :)

1:48 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 2, 2003
posts:724
votes: 0



On November 6, 2002 Google posted '3,083,324,652 web pages' on its site. If there is any truth to this post, Google has had more than 7 months warning to introduce stronger spam filters, 404 eliminations, etc. - as it concurrently developed a long-term solution.

While such a revelation would certainly explain a lot - it is still inconceivable that Google would stake its very existence on such a huge mistake.

1:50 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 25, 2001
posts:661
votes: 1


Re5earcher is hiding outside, waiting for the kerboom.

You know - you fire up the Illudium Q-36 Explosive Space Modulator - run outside, put you hands over your ears - and wait for the Kerboom.

25 responses in 1.5 hours is a pretty good kerboom re5earcher - you can come back in now....

1:51 pm on June 7, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Apr 18, 2003
posts:618
votes: 0


takagi: I believe by reading your earlier posts that you have all the publicly available data for pages indexed over time. What do you think about the projected number of current pages from that data for G? (I was not paying much attention to ATW - looks like they are still okay.)

mat

1:52 pm on June 7, 2003 (gmt 0)

Preferred Member from IT 

10+ Year Member

joined:Apr 5, 2002
posts:633
votes: 0


Ah, but be sure to leave a good breadcrumb trail to follow when you come back in, else those durn acid enemas and PIR flash-bangs'll get ya.
2:00 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 8, 2002
posts:2015
votes: 0


>I've created a few ram disks on the mac, and watched apps fly...very impressive.

Oh yes! john, i'm dreaming since a long time to run my databases right from a ram disc. Unfortunately i'd need a 12 gig ram disc ... and a macos that could handle it. :(

2:08 pm on June 7, 2003 (gmt 0)

Senior Member

joined:Nov 20, 2000
posts:1336
votes: 0


The problem is that there is a problem. We haven't had an update in yonks, and are running on a core winter dbase in summer. There is still no sign that an update in round the corner, and Deepbot hasn't been seen in months.

Consequently, sooner or later, speculation was bound to start. Questions have been ongoing for weeks (we've all seen the 'Google is broke' threads).

No, I have no idea whether this story is correct or not. The problem is that against a background which gives every indication of problems, an allegation of a specific nature cannot be easily dismissed, especially when it sounds tenable and could explain some of the current non-happenings.

Yes, Google's or GoogleGuy's response would be very interesting.

2:12 pm on June 7, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:June 6, 2003
posts:67
votes: 0


It was probably planned since the very start and this is the right time to move from the 32 bit limit.

Google projected this change 5 years ago, but now is the time to do it and while it happens the index is a mess....

What's the big fuss about...Google is changing and it's happening on every production server for the public to see because it HAD to happen on the production servers some day.

Yeah they lost the last crawl because they didn't expect it to overflow now, so NOW is the time to upgrade to a system they already have been testing for a long time.

This is what's happening guys, what's the all emotional stuff about?

2:19 pm on June 7, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 23, 2003
posts:801
votes: 0


Although allanp73 said that April Fool's day was ages ago, I still think it's the greatest joke I've heard for a long time!

Delaying announcing the hoax for as long as the Dominic Update has over-run is a stylish addition...

Nicely done - almost a British sense of humour there!

Regards from Britain...
DerekH

This 128 message thread spans 5 pages: 128