Forum Moderators: open
Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.
When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.
They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.
This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.
This procedure will require 1000 new page index servers and additional storage for temporary tables.
They are hoping to make this change gradually server by server.
The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.
Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.
[just a guess but who knows]
[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]
I read the whole thread so far and I think some of you might like to read this:
[computer.org...]
This URL has been posted previously on WW. It reveals some of the G technology.
See you.
Personally, I feel like this would have been a giant mistake by Google when they were planning their system, but they may have never thought the site would grow as fast as it has.
I think it is pointless to worry about it, since it is out of my hands. I am happy to know that there MAY be an explanation for the results Dominic has brought me. Maybe I will sleep a little better now. (c:
That is of course if they threw everything in a single DB. And why would they do that, right? So, figure that they not just have a single DB, but a whole farm of servers each with multiple DBs. The possibilities are endless! :)
From what I have been able to tell, google has the index spread through several machines. In that case it would make sense for the page ID to be a struct instead of a long.
struct {
uint machine;
ulong ID
} pageID;
This would actually make sense from a speed perspective.
1. Number of web pages will grow exponetially till it exceeds the ID capacity of Google.
Comments: I believe that we have more pages than would have been if Google was not around because
1. More pages help us in increasing PR of selected pages.
2. More pages allow us to Google-optimize each page for specific keywords.
2. Number of links per page will go up exponentially till all the web pages in the universe are linked to all the other web pages in the universe.
Comments: We are still far from there but since linking is the core philosophy behind Google, we have more links than would have been if Google was not around. I can envision a future where an autometed program will crawl the web checking files on sites to get permission to exchange links, and then automatically add reciprocal links.
3. Total time to crawl all the web pages and compute their PRs will exceed whatever time schedule Google had in mind for updating it databases.
Final comment: Google is just begging to be broken. ;)
Why is there a concentration on 32-bit words of page IDs when google runs a distributed system? From what I have been able to tell, google has the index spread through several machines. In that case it would make sense for the page ID to be a struct instead of a long.
I think the reason is that you want the ID to be as short as possible. Every unique word in every web page repeats this ID. Sure, once you have the ID you must use it to look up other metrics for the page. But for efficiency of the front-end inverted index that comes up with the hit list, you must have a short ID. Once the hit list gets ranked into your 10 SERPs for the next page, you look up other page data for those 10 pages. (Or 100 pages, or whatever, eventually trimmed to 10 SERPs once you consider on-page factors -- at least you are no longer dealing with a huge front-end inverse index at that point, but rather with a manageable subset of "hits.")
Sure, the index is spread over more than one machine. Each of the 15,000 Linux boxes has the same basic software configuration, but each must also have a config file telling that machine which specific function it performs based on which block of data it can access. There is no other way to handle all that data. But the software is the same on each box, so installation of new boxes is easy. You just set the config file for what you want that machine to do, load the proper data, and it does it. It's a parallel system in the sense that it's as modular and redundant as possible, but it's still broken down into specific tasks.
As far as using a structure instead of the 32 bits, I don't think so. It's much faster to just mask out some bits and route the ID to the machine needed based on the bits selected through masking. You have all the granularity you need through masking.
In the inverse index, I suspect the docIDs after each word are ordered by PageRank. This would not take any extra space (although it takes more processing once a month), and it would make the PageRank portion of the algo virtually automatic from that point forward.
>> Not that it makes a difference since GoogleGuy already dispelled the technical issue
If you read his posts on this thread you will actually see that he doesn't deny that google are changing from a 4 byte id to a 5 byte id. He denies that they have reached their limit - which would only happen if they didn't increase it first.
I also read on the link Brett posted, and it points out a change that doesn't leave a [edited by] tag, in which Brett removes the "this is a bogus thread" part so I dunno.. makes me wonder.
bolitto : The cache is seperate from the index as others can testify. that id could be a machine name or anything, but if it was the DocID it wouldn't need the URL. Also we know that it is a Number not a string.
You'd need the services of a Byzantine theologian to figure out what GoogleGuy means.
LOL....so true.
And this assumes that GG knows the answer. It's possible that "new algorithms" is a cover story for most Google employees, as well as the rest of us. They're not yacking about 4-byte overflows at the Googleplex water cooler, and not at the local bar either. It's the sort of thing that employees don't talk about if they're a "team player."
Could be. Still, I find it hard to believe that all those PhD's wouldn't be able to figure out a more transparent solution than missing an update or two. Seems like you could temporarily purge the lowest PR pages, just cull them from the database for later reinsertion, allowing new pages to be added even while transitioning to the new system. I'm sure I'm underestimating the complexity there, but it sounds possible.
In any case, I will be interested to know what actually has been going on (if we ever find out, that is). The fact that I can't think of much that would require google to miss an update makes me think that either something broke or we're in for a big change in the next real update. Everything I know about google tells me not to expect drastic changes, though.
[webmasterworld.com...]
I don't think it's technically possible that 2 pages get the same ID. That would make the db corrupt.
driesie but this is what has happened! a mix up in internal IDs on Google side.
[webmasterworld.com...]
Just my two pence worth!
;O)
1) Search for www.mysite.com
2) If you then click on "find web pages that link to www.mysite.com", you will see that the query is:
Searched for pages linking to [12 characters]:www.mysite.com
Doesn't that say something about the way Google indexes pages?
Note: doing the same search for a new page that was recently picked up by freshbot (and can be found in the SERPS), I got the result:
Sorry, no information is available for the URL www.mysite.com/veryrecentpage.htm
However, it offers me to "find web pages that contain the term "www.mysite.com/veryrecentpage.htm", and it knows which page on my site links to this very recent page. This suggests that fresh new pages are stored differently than permanently indexed pages. Perhaps these pages will "stick" permanently only once they are assigned their own 12-character unique coded ID...?
You will find out that
link: (12 Chars):www.mydomain.com
returns that same results
as
link: (12 Chars):www.hisdomain.com
Wierd
Ok, thanks, still a relative Newbie
And my post has bene analyzed here [google-watch.org...] - here's the quote from his page about my theory:
"One poster looked at Google's URL for their cache copies, and concluded that the string of 12 alphanumeric characters, upper plus lower case, gave Google 62 to the 12th power for their web page ID, which leaves plenty of room for expansion."
Edited.
The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:
4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)
12 bytes: 300 * 4 billion * 24 = 2.88 to 13th power (29 terabytes)
20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)
If you were designing a search engine, how many bytes would you choose for your docID?