Forum Moderators: open

Message Too Old, No Replies

I think google reached its ID capacity limit?

         

re5earcher

12:04 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



Google has reached it's data indexing capacity of 4,294,967,296 (2^32) URLs.

Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.

When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.

They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.

This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.

This procedure will require 1000 new page index servers and additional storage for temporary tables.

They are hoping to make this change gradually server by server.

The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.

Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.

[just a guess but who knows]

[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]

killroy

7:01 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



not infinity, just "what we need plus one" I designed a table once with 0-9999 IDs sicne there are 15000 possible candidate records of which I figured 5000 dropped out.

On large systems you cannot afford to "leave plenty of space just in case" you gotta design tight. I mean google is something, that 10years ago lots of folks would've said can't be done. (like what folks say about instant PR today)

Also, if such a limit is reached "while nobody's watchign" it can REALYL **** things up, such as loosing or corruptign an update. Furthermore, a google update is not like a prograss bar that someboidy watches. It's a huge distributed operation happening all over the place that no single person can overlook or know at what state it is. So missing the update before the last caneasily happen. Some folks seem to think that google, just cos they created a big bad system, they're infallible... in fact all practical big bad systems in operation today are majorly flawed.. the nature of today's IT ;)

SN

PS: again, not saying this has happend, jsut pointing out that it could'Ve easily, more then most people believe.

IITian

7:10 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



killroy:

My guess is that the Google knew about this problem long time ago. Their page count stopped around 3 billion. Now you are guessing that last update could have been a bad one and they could have lost data, my guess is that it could have been a "fake" one! ;)

Google was working on its problems but the webmasters were waiting for the update so it decided to have a fake Dominic crawl. This kept people from getting impatient and spreading rumors about it. Now, as is true with most large systems, their problems didn't get resolved in time. Mow they are not even pretending! Just speculation. :)

Macguru

7:14 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



DerekH,

I am certainly not part of the panicking crowd. I find this thread absolutely fascinating.
I believe it is a very logic theory about the Dominic bruhaha. I consider it has a less negative impact than other thinghs we could imagine.

>>just pointing out that it could'Ve easily, more then most people believe

He he! reminds me of what happened to Altavista a few years back.

eaden

7:21 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



But sounds like a pretty valid theory. Especially considering those facts :

1)4,294,967,296 is a 32 bit value limit

2)Google's home page claims "Searching 3,083,324,652 web pages" since a while plus non standard documents.

3)Back in Januany, some Google founder claimed he hoped to expand to 10 billion records this year


and these :

* Freshbot doesn't go through the usual deepbot indexing process, and may not generate extra docID's, so this would also explain why "freshie has been crawling deeply in addition to normal freshie duties."

* Why this update is SO much work for google, and going to take so long.

* talk from GG about "worst-case" backup

Also I'd like to point out, that originaly (before editing) the poster of the first message in this thread had "from inside sources" and not "just a guess".

And I'm not saying this is an unexpected problem for google, but it may just be what this update is about.

Macguru

7:25 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>talk from GG about "worst-case" backup

OOps I missed that one! Do you remember if it is in the "understanding Dominic" thread?

killroy

7:32 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hehe no panic at all, just a discussion of a very real issue of a large scale database. Not Google in particular, but using google as an example.

No more, no less, for me at least. I just love this stuff ;) And of course the idea that great Google might suffer thesame troubles as I once did. So I'm not exactly unbiased either ;)

SN

eaden

7:36 pm on Jun 7, 2003 (gmt 0)

10+ Year Member




>>talk from GG about "worst-case" backup
OOps I missed that one! Do you remember if it is in the "understanding Dominic" thread?

In the Is Freshbot now Deepbot? thread
[webmasterworld.com...]

If I was doing what is suggested Google is doing, I would be talking like GG in that thread! "Cross verification, worst-case backup, cautious about the update, switch to as a safety net..."

DerekH

7:39 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, so you're still having a panic while I sip my white wine and let the veggie chili bubble away...

How about a different hypothesis.
Not "Dominic has gone wrong"
and
Not "Tthey've hit 4 gigs and it's broken."

Tell you what, I'll start a *different* rumour...
The REASON Dominic has take so long is because they've just got rid of the 4 gig limit.
After all, the database *is* a different format, everyone's already said that.

So my theory is...
The 4 gig limit isn't a problem, Dominic was *specifically* brought in to solve it.

Back to my wine while you prove me wrong! <wink>
Regards
DerekH

Macguru

7:48 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks a lot eaden!

>>Back to my wine while you prove me wrong! <wink>

Cheers!

DerekH, If it was brought to solve that, the least I can say is it wan't exacly a seamless solution. <wink><wink>

Have to go now, the Gizmo Quiz in 10 minutes...

killroy

7:52 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



DerekH> That's pretty much what we've been trying to say... nobody suggested Google was caught with their pants down ;)

SN

GoogleGuy

7:57 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I could have a fun time with one. Instead I'll just say nah. I like Giacomo's suggestion best--maybe a separate forum for SE-fiction. We could put the conspiracy theories there too. :)

Sally Stitts

8:00 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Please check this out:
[geocities.com...]

Thanks,
Sally

<--EDIT:
Eaden, Right! I don't know why, but I get the same thing.
The link is correct.
I found that it works if you cut and paste it to your browser. Must be some kind of WebmasterWorld, or maybe SBC-Yahoo glitch.-->

[edited by: Sally_Stitts at 8:29 pm (utc) on June 7, 2003]

eaden

8:02 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



Sally, that link doesn't work now, have a mirror?

GG, there has been a lot said in this thread, so 'nah' could refer to anything...

I take it by your reply that the number of bytes in a page index field hasn't changed from 4 to 5 in the last 6 months or so, and nothing of this sort is happening.

[edited by: eaden at 8:03 pm (utc) on June 7, 2003]

killroy

8:03 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not paranoid! Google IS out to lower my ranking!

And just cos GoogleGuy says it ain't so doesn't mean he isn't in on their scheme!

PS: how wide ARE your ID fields? 5 bytes? 8?

SN

IITian

8:07 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



Killroy: 5 or 8?
Answer is very obvious. Just look at who started this thread!

GoogleGuy

8:54 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I was answering the title of this thread, eaden. Did anyone catch the IP address of that masked re5earcher? ;) (just kidding)

DerekH

9:00 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



GoogleGuy wrote
"I was answering the title of this thread, eaden. Did anyone catch the IP address of that masked re5earcher? ;) (just kidding) "

No more masked than others on this list, Mr G :-)

Kackle

10:51 pm on Jun 7, 2003 (gmt 0)



Well, I rather like the unsigned integer rollover theory at 4.2 billion for a four-byte ID. It explains what I saw on the April 11 update. About 15,000 out of my 50,000+ pages were suddenly turned into URL-only links, meaning that Google knew about them (and yes, they were crawled too), but didn't put them in the index. They had been in the index reliably for almost two years. It was bizarre and this is the first glimmer of an explanation I've seen.

Maybe these 15,000 pages lost their ID number by getting overwritten with other pages using that ID, and that's why they ended up as URLs that Google could display, but not reference in their index.

Anyone else see weird stuff on April 11?

Yidaki

11:23 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Did anyone catch the IP address of that masked re5earcher?

ROFL! Welcome at ncc GGuy! :)

Polarisman

11:41 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



I for one am pleased to say that G had enough room for our new (three weeks old) site before they ran out of room.

Not only were they kind enough to squeeze us in, they even managed to give us many fine placements in the SERPS for our keywords.

The SEO game brings new meaning to the phrase "survival of the fittest". Darwin would be proud.

bolitto

4:51 am on Jun 8, 2003 (gmt 0)

10+ Year Member



Guys I've had a sudden change of heart.

Have a look at this

bvte6I7cVAZy

It's a URL-ID from looking at the ID's generated to speed up the cache lookup, it's 12 digit alphanumeric. It is shown for the cache of any page on a SERP.

26 lower case characters(correct?), 26 upper case characters, 10 digits, makes for 62 possible characters for each of the 12 digits so total combination would, in theory be 62^12 - more than enough for our measly 4 billion pages.

Well so much for the rumors I've heard, I guess our theory dies here.

Chris_D

4:56 am on Jun 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Sally_Stitts,

If you do this search in Google, you'll find the mirror of the site you are looking for. Or at least you'll be able to read the Google cache in html...

"c-6.powers_of_2.4.pdf"

See - Google's not broken for the general public - its just 'under construction' - and the major impact is on sites that have been added, added to, changed in the past 9 weeks or so. And if those sites were 'news sites' - they aren't affected. So its a small % of the 3 billion.

BTW - try the same search to find Sally's mirror at AllOverFAST...

Don't you all think its unusual that re5earcher hasn't reposted at the scene of the crime? As I said about 4 pages back - he's hiding outside (with Marvin the Martian) - waiting for the 'kerboom'.

Chris_D

bolitto

5:05 am on Jun 8, 2003 (gmt 0)

10+ Year Member



Just as a sidenote. IF my theory is right, would 62^12 be larger than a Googol?

IITian

5:47 am on Jun 8, 2003 (gmt 0)

10+ Year Member



...would 62^12 be larger than a Googol?

Since 62 is less than 100, 62^12 is less than 100^12 = (10^2)^12 = 10^24 which is less than 10^100 = 1 googol.

Brett_Tabke

11:47 am on Jun 11, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



>>As others have said, changing the width of a single field is trivial.

It may be, but will have dramatic consequences on a DB of this size. Something users would have noticed right away.

Where's this "python" stuff coming from? The only thing public that I've heard is the core is pure ML with the rankings routines C on top and glued here-n-there with the internet duct tape (Perl).

> How do you get 3 billion of anything on an
> 80 gig hard drive?

That is probably the most important statement in the whole thread. The index can be expanded, but you can only cram so much of it on a disk that size. Reportidly, that is the size of G's first 10k of boxes.

> john316, last thing I have heard is that Google
> database was stored in RAM.

Not the db, just the index file (key pointer file).

> Is a period of 2 months not too short for
> such a drastic change?

Exactly - could be why the april crawl was tossed.

Then again, a switch from monthly indexing under the old system to daily indexing via freshbot is a monumental change.


What if a very public and visible company saw that it had no critics in it's market space? Every marketer knows that nature abhors a vacum - the market generally fills the available space with crap if you don't fill it yourself.

What would you do if you were them? Wouldn't you want to fill that space with your own arguments and support - even if negative in appearance just to give the appearance of critics?

Personally, I'd fill it with shallow indefensible arguments and hire a self promotion expert to promote it from an "anti" site. I'd get them to go out and talk about things like Cookies, or promote a great social welfare agenda from the Ralph Nadar crowd that is no longer cool. I'd get them to think up conspiracy theories and to RF (ex: Donald Segretti wrote the humphery letter in 68) anything they could. [google-watch.org...]

eaden

12:14 pm on Jun 11, 2003 (gmt 0)

10+ Year Member



Interesting article Brett. As mentioned, at some point we can be sure that the DocID was 4 bytes, the question is if it was changed long ago, or is changing now.

GoogleGuy hasn't denied this is what they are/have been doing so it's still a possibility. Of course Google hasn't reached it's ID capacity limit - the limit would be increased before that ever happens.

trillianjedi

12:15 pm on Jun 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think that's probably stretching it a little far Brett.

Interesting thread this one. Can anyone clarify the software technology that GG are using - python, C, asm etc?

This is for my own curiosity not any revelations.

Thanks,

TJ

brotherhood of LAN

12:21 pm on Jun 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>>>Can anyone clarify the software technology that GG are using - python, C, asm etc?

Last I read their crawler was in python, I think Brett mentioned more above about the actual index.

Also they have their own way of encoding things (not huffman)....both of these facts are in the dino-backrub paper ;)

>>>>4 bytes/32 bits

This search [google.com], it's increased 400 million over the last month, if it passes over 4 billion, say goodbyte to your 4 bytes :)

Chris_R

12:23 pm on Jun 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hopefully this Google Watch summary page will encourage more discussion of what's going on at Google, even if webmasterworld.com won't.

From google-watch

Hmmm, I wonder what pages he is reading.

There are probably more theories on this site (and almost as much "evidence") on this supposed google bug than the kennedy assassination.

Yeah - I am sure this is the reason....

amazed

12:26 pm on Jun 11, 2003 (gmt 0)

10+ Year Member



Kackle, I saw the same and was thinking along capacity attributed according to page rank lines.
This 128 message thread spans 5 pages: 128