Forum Moderators: open

Message Too Old, No Replies

I think google reached its ID capacity limit?

         

re5earcher

12:04 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



Google has reached it's data indexing capacity of 4,294,967,296 (2^32) URLs.

Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.

When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.

They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.

This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.

This procedure will require 1000 new page index servers and additional storage for temporary tables.

They are hoping to make this change gradually server by server.

The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.

Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.

[just a guess but who knows]

[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]

Gonzalez

2:08 pm on Jun 11, 2003 (gmt 0)

10+ Year Member



Hi everybody.

I read the whole thread so far and I think some of you might like to read this:

[computer.org...]

This URL has been posted previously on WW. It reveals some of the G technology.

See you.

Allergic

2:36 pm on Jun 11, 2003 (gmt 0)

10+ Year Member



There definitively a major glitch somewhere. This morning I saw in the SERPs a page I modified back in febuary and was gone until yesterday.
A new Wayback Machine for Google ;-)

werty

3:38 pm on Jun 11, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



This would make a bit of sense. I just like that I have something new to explain to people who question why pages were lost or taken out of the index.

Personally, I feel like this would have been a giant mistake by Google when they were planning their system, but they may have never thought the site would grow as fast as it has.

I think it is pointless to worry about it, since it is out of my hands. I am happy to know that there MAY be an explanation for the results Dominic has brought me. Maybe I will sleep a little better now. (c:

Tapolyai

3:52 pm on Jun 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not that it makes a difference since GoogleGuy already dispelled the technical issue, but if Google uses mySQL for example as a DB storage, they can use a BIGINT for indexing which would give them 18,446,744,073,709,551,615 key values. If they used a double the number goes to 1.7976931348623157E+308 (yes that's 1 with 308 zeros after it).

That is of course if they threw everything in a single DB. And why would they do that, right? So, figure that they not just have a single DB, but a whole farm of servers each with multiple DBs. The possibilities are endless! :)

BigDave

4:20 pm on Jun 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why is there a concentration on 32-bit words of page IDs when google runs a distributed system?

From what I have been able to tell, google has the index spread through several machines. In that case it would make sense for the page ID to be a struct instead of a long.

struct {
uint machine;
ulong ID
} pageID;

This would actually make sense from a speed perspective.

IITian

4:19 pm on Jun 11, 2003 (gmt 0)

10+ Year Member



I would argue that Web didn't grow "organically" and is in fact shaped to a large extent by Google. Three of its main effects can be seen in the following. If you know of more that will be nice too.

1. Number of web pages will grow exponetially till it exceeds the ID capacity of Google.

Comments: I believe that we have more pages than would have been if Google was not around because
1. More pages help us in increasing PR of selected pages.
2. More pages allow us to Google-optimize each page for specific keywords.

2. Number of links per page will go up exponentially till all the web pages in the universe are linked to all the other web pages in the universe.

Comments: We are still far from there but since linking is the core philosophy behind Google, we have more links than would have been if Google was not around. I can envision a future where an autometed program will crawl the web checking files on sites to get permission to exchange links, and then automatically add reciprocal links.

3. Total time to crawl all the web pages and compute their PRs will exceed whatever time schedule Google had in mind for updating it databases.

Final comment: Google is just begging to be broken. ;)

Kackle

5:07 pm on Jun 11, 2003 (gmt 0)



Why is there a concentration on 32-bit words of page IDs when google runs a distributed system? From what I have been able to tell, google has the index spread through several machines. In that case it would make sense for the page ID to be a struct instead of a long.

I think the reason is that you want the ID to be as short as possible. Every unique word in every web page repeats this ID. Sure, once you have the ID you must use it to look up other metrics for the page. But for efficiency of the front-end inverted index that comes up with the hit list, you must have a short ID. Once the hit list gets ranked into your 10 SERPs for the next page, you look up other page data for those 10 pages. (Or 100 pages, or whatever, eventually trimmed to 10 SERPs once you consider on-page factors -- at least you are no longer dealing with a huge front-end inverse index at that point, but rather with a manageable subset of "hits.")

Sure, the index is spread over more than one machine. Each of the 15,000 Linux boxes has the same basic software configuration, but each must also have a config file telling that machine which specific function it performs based on which block of data it can access. There is no other way to handle all that data. But the software is the same on each box, so installation of new boxes is easy. You just set the config file for what you want that machine to do, load the proper data, and it does it. It's a parallel system in the sense that it's as modular and redundant as possible, but it's still broken down into specific tasks.

As far as using a structure instead of the 32 bits, I don't think so. It's much faster to just mask out some bits and route the ID to the machine needed based on the bits selected through masking. You have all the granularity you need through masking.

In the inverse index, I suspect the docIDs after each word are ordered by PageRank. This would not take any extra space (although it takes more processing once a month), and it would make the PageRank portion of the algo virtually automatic from that point forward.

eaden

7:59 pm on Jun 12, 2003 (gmt 0)

10+ Year Member



This is now mentioned on the google weblog, but has pointed out that googleguy has denied this.

>> Not that it makes a difference since GoogleGuy already dispelled the technical issue

If you read his posts on this thread you will actually see that he doesn't deny that google are changing from a 4 byte id to a 5 byte id. He denies that they have reached their limit - which would only happen if they didn't increase it first.

I also read on the link Brett posted, and it points out a change that doesn't leave a [edited by] tag, in which Brett removes the "this is a bogus thread" part so I dunno.. makes me wonder.

bolitto : The cache is seperate from the index as others can testify. that id could be a machine name or anything, but if it was the DocID it wouldn't need the URL. Also we know that it is a Number not a string.

Kackle

12:06 am on Jun 13, 2003 (gmt 0)



You'd need the services of a Byzantine theologian to figure out what GoogleGuy means. And this assumes that GG knows the answer. It's possible that "new algorithms" is a cover story for most Google employees, as well as the rest of us. They're not yacking about 4-byte overflows at the Googleplex water cooler, and not at the local bar either. It's the sort of thing that employees don't talk about if they're a "team player."

bcc1234

12:41 am on Jun 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If they used a double the number goes to 1.7976931348623157E+308 (yes that's 1 with 308 zeros after it).

He-he, no it's not :)

It would be interesting to know what the use to manage their data. I doubt it's MySQL, but does anybody know?

[edited by: bcc1234 at 12:42 am (utc) on June 13, 2003]

Dolemite

12:41 am on Jun 13, 2003 (gmt 0)

10+ Year Member



You'd need the services of a Byzantine theologian to figure out what GoogleGuy means.

LOL....so true.

And this assumes that GG knows the answer. It's possible that "new algorithms" is a cover story for most Google employees, as well as the rest of us. They're not yacking about 4-byte overflows at the Googleplex water cooler, and not at the local bar either. It's the sort of thing that employees don't talk about if they're a "team player."

Could be. Still, I find it hard to believe that all those PhD's wouldn't be able to figure out a more transparent solution than missing an update or two. Seems like you could temporarily purge the lowest PR pages, just cull them from the database for later reinsertion, allowing new pages to be added even while transitioning to the new system. I'm sure I'm underestimating the complexity there, but it sounds possible.

In any case, I will be interested to know what actually has been going on (if we ever find out, that is). The fact that I can't think of much that would require google to miss an update makes me think that either something broke or we're in for a big change in the next real update. Everything I know about google tells me not to expect drastic changes, though.

shaadi

8:54 am on Jun 16, 2003 (gmt 0)

10+ Year Member



Is it possible that if there are not many IDs left, two pages or more will have the same ID? Is this what happening in my case?

[webmasterworld.com...]

driesie

11:22 am on Jun 16, 2003 (gmt 0)

10+ Year Member



I don't think it's technically possible that 2 pages get the same ID. That would make the db corrupt.

shaadi

1:29 pm on Jun 16, 2003 (gmt 0)

10+ Year Member



I don't think it's technically possible that 2 pages get the same ID. That would make the db corrupt.

driesie but this is what has happened! a mix up in internal IDs on Google side.

[webmasterworld.com...]

driesie

3:17 pm on Jun 16, 2003 (gmt 0)

10+ Year Member



I'm facsinated by this!

How can 2 IDs get mixed up, that must be a pretty major bug I'd imagine.

Did google give you any answers yet? It's been going on for a while hasn't it?

Hollywood

5:05 pm on Jun 16, 2003 (gmt 0)

10+ Year Member Top Contributors Of The Month



Just my good opinion, but I highly doubt Google does nto plan ahead for something like this; with the technology around today and Googles partners they figured this in a year ago or more.

I say no way! --> 100%

This is not the reason at all! This is to easy to plan ahead for people.

~Hollywood

killroy

6:39 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Every done this yourself? What is your authority to say this sort of thing is easy?

SN

seekanddestroy

6:51 pm on Jun 16, 2003 (gmt 0)



If this isn't complete drivel, then the logical Google short term solution to the problem would be to play hardball with spam algos etc, and only index pages with a PR above a certain number, I guess that would not be possible with a monthly update, but perhaps if they switch to rolling updates/PR calculations then they could do it.

Just my two pence worth!

;O)

Spica

1:18 pm on Jul 3, 2003 (gmt 0)

10+ Year Member



There seems to be a 12-character ASCII string associated with pages that are in the Google index. I had never noticed that before, but you can see it if you do the following:

1) Search for www.mysite.com
2) If you then click on "find web pages that link to www.mysite.com", you will see that the query is:
Searched for pages linking to [12 characters]:www.mysite.com

Doesn't that say something about the way Google indexes pages?

Note: doing the same search for a new page that was recently picked up by freshbot (and can be found in the SERPS), I got the result:
Sorry, no information is available for the URL www.mysite.com/veryrecentpage.htm
However, it offers me to "find web pages that contain the term "www.mysite.com/veryrecentpage.htm", and it knows which page on my site links to this very recent page. This suggests that fresh new pages are stored differently than permanently indexed pages. Perhaps these pages will "stick" permanently only once they are assigned their own 12-character unique coded ID...?

Pricey

1:22 pm on Jul 3, 2003 (gmt 0)

10+ Year Member



The mysteryman topic got dug up :P

I just looked at the 12 char ASCII, I never noticed that either.

Seems that freshie simply lays a path for deepbot to crawl.

Dayo_UK

1:47 pm on Jul 3, 2003 (gmt 0)



Interesting thing is that code does store the page or url - try doing the above as spica suggest and change the domain name after the aplhanumeric string

You will find out that

link: (12 Chars):www.mydomain.com

returns that same results

as

link: (12 Chars):www.hisdomain.com

Wierd

mipapage

2:34 pm on Jul 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Spica,

That's an interesting find. Gonna dig into it.


OT
You should think about starting a new thread on this as it's interesting, and not at all to do with the thread title or topic, no?

heini

2:37 pm on Jul 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Mmm, what's so weird about it? It simply is the internal ID for any given url. People being unfortunate enough to get their url mixed up by Google are quite familiar with that phenomenon.

Dayo_UK

2:42 pm on Jul 3, 2003 (gmt 0)



Heini

Ok, thanks, still a relative Newbie

heini

2:44 pm on Jul 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>still a relative Newbie
Aren't we all? Next thing you know is someone comes along and proves you totally wrong...:)
The above however is my understanding of that ID.

Spica

6:06 pm on Jul 3, 2003 (gmt 0)

10+ Year Member



Heini:

I am not sure I understand what you mean by "internal ID". Are you saying that this number is not assigned by Google? Then how is this number generated? Is there a code associated with it?

Could you please explain for all of us newbies your understanding of what this number is? Thanks!

James_Dale

6:28 pm on Jul 3, 2003 (gmt 0)

10+ Year Member



yes, it is assigned by Google. I'm lucky enough to have two (count 'em) right now! one for domain.com and another for www.domain.com

Yippee!

bolitto

6:33 pm on Jul 3, 2003 (gmt 0)

10+ Year Member



I've posted the ASCII key observation earlier here on this thread [webmasterworld.com...]

And my post has bene analyzed here [google-watch.org...] - here's the quote from his page about my theory:

"One poster looked at Google's URL for their cache copies, and concluded that the string of 12 alphanumeric characters, upper plus lower case, gave Google 62 to the 12th power for their web page ID, which leaves plenty of room for expansion."

Edited.

Kackle

7:05 pm on Jul 3, 2003 (gmt 0)



Best estimates are that on the average, each docID is used twice per word per page. That's because they have two inverted indexes. One is "fancy" and the other is "plain."

The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:

4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)

12 bytes: 300 * 4 billion * 24 = 2.88 to 13th power (29 terabytes)

20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)

If you were designing a search engine, how many bytes would you choose for your docID?

hutcheson

9:32 pm on Jul 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>If you were designing a search engine, how many bytes would you choose for your docID?

Um, eight?

And if I ever displayed that docID, I'd use printable ASCII characters (which would take about, um, 12 characters) rather than hexadecimal (16 characters) or decimal (~21 characters).

This 128 message thread spans 5 pages: 128