Forum Moderators: open
Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.
When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.
They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.
This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.
This procedure will require 1000 new page index servers and additional storage for temporary tables.
They are hoping to make this change gradually server by server.
The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.
Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.
[just a guess but who knows]
[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]
Instead of 8 bytes, you'd use 4 bytes. No question about it. Then if you want to display or preserve the ID somewhere, like in a URL, you would use the 4 bytes to retrieve the long version of the ID, using a conversion algo. You can insert this alphanumeric long version into the cache copy URL and use it to retrieve the 4 bytes later. This conversion process is cheap, because it doesn't have to be done all that often. Certainly not as often as the intensive amount of processing required on the inverted indexes, which are immediately consulted with every new query coming into the front end of Google.
If you don't do it this way, your boss would fire you for wasting all that money, and the search engine would go bankrupt. All you'd have left to defend yourself is a whole bunch of postings on WebmasterWorld.
I really don't understand why it's so hard for everyone to agree why a 4-byte docID makes a lot of sense. The only time it stops making sense is when you have to move beyond 4 bytes in order to expand beyond 4.2 billion web pages.
This isn't rocket science, folks.
Most of this is way over my head - but what would be the implications of this (aside from Google running out of space to store new pages)? Would it be a software modification that would be required (split pageid across two 4bit memory addresses) - or would they need to invest in major new hardware/databases etc?
Also,
Why was the link:xyz: code left on the page? First I thought it was an oversight (unlikely). Having it there though (and for the 'cached' link) would speed up the google servers as they do not have to convert the url into its pageid.
That sounds reasonable to me.
The software changes are the big problem. It's safe to assume that the docID is involved in a lot of their custom software, and a lot of definitions of field lengths in various tables.
Phasing in the changes is another problem. You cannot intermix the old and new software. In other words, you can't have the new software grabbing a fifth byte for the docID if that fifth byte is grabbed from a table that is based on the old 4-byte system.
I don't know why they have those characters in the link command. But I do know know this is some other section of Google. The inverted indexes, where it's very important to have a short docID, don't have link information in them. That happens elsewhere in the scheme of things.
It's possible that they had the fifth byte all installed many months ago. But in that case, I fail to understand why GoogleGuy cannot simply say, "I have been authorized by our CEO, and this is confirmable through our public relations office, to announce that we successfully installed the expansion to five bytes for the docID many months ago, and are in no danger of running out of docID space. Therefore, rumors of a docID problem are completey unfounded. Please call xxx-xxx-xxxx if you want confirmation."
Instead, GoogleGuy says he told a Google engineer about the four-byte docID story, and this engineer almost fell out of his chair laughing. End of comment. End of story. No one goes on record at Google, Inc. who is personally identifiable.
That's why I continue to think there's something to the theory.
treat 4 byte IDs as if the 5th byte is 0, that'S the first set of 4 billion IDs. then second set (new IDs) will have the new byte set to 1, and so on.
That still leaves teh upgradign of all teh tables on all the servers though ;)
SN
>If you don't do it this way, your boss would fire you for wasting all that money, and the search engine would go bankrupt. All you'd have left to defend yourself is a whole bunch of postings on WebmasterWorld.
Had you not neglected to add in the cost of the table to do the ID conversions, your analysis would still be flawed by the mathematical characteristics of those "short-id-to-long-id" conversions. One-to-MANY isn't reversible!
>I really don't understand why it's so hard for everyone to agree why a 4-byte docID makes a lot of sense. The only time it stops making sense is when you have to move beyond 4 bytes in order to expand beyond 4.2 billion web pages.
Going through the pain of a key redesign can ruin your whole day; professional software engineers learn to avoid it. And you might contemplate the possibility that Google may have been DEALING with over 4.2B pages for a long time now, even though not including them all in the index.
>This isn't rocket science, folks.
That's OK. Rockets crash sometimes too.
As a programmer, I can say _that_ story rings true as a bell. I bet they're still chuckling. I thought it was pretty funny myself.
I'm no rocket scientist, but I have had to deal with some programs that ran orders of magnitude slower than necessary specifically because of design features shared by the interesting almost-algorithm described above. And when that happens, the original designer is a running gag around the water coolers for months.
Sure it is. You must be thinking of a one-way hash. We're not talking about a one-way hash. It may look one-way when you see the alphanumerics in the URL, but that's only because you aren't privy to how it was put together. It's fully reversible.
I can take a four-byte integer and convert it to decimal. I can take a decimal and convert it to 32 bits of ones and zeros.
I can take a four-byte integer and convert it to hex. I can take hex and convert it to a 32 bits of ones and zeros.
I can take a four-byte integer and convert it to my very own numbering system consisting of upper and lower-case alphanumerics, plus the underscore and hyphen, giving me a base-64 system that will survive in a URL on the Internet. And I can take this and convert it back to 32 bits of ones and zeros. And if my special number is six characters long, I can cover over 68 billion (64^6) web pages. If it's longer, I can include other data -- perhaps a pointer for the location of the cache copy, for example, or some flags.
You don't need a lookup table for any of these conversions. All you need is a few lines of code. And I don't know the first thing about rockets. The point is that I can do all this and still use a 4-byte or 5-byte docID and save all those terrabytes of RAM where such savings count the most.
The docID has to be small because for every word in every search query that comes into Google, you have to collect the docIDs behind that word in the inverted index before you can do any processing. About the only way to shorten the process is to have the docIDs ordered behind each word (maybe according to PageRank?). That way, you can stop collecting after you reach a certain number of docID matches, with the confidence that you have scraped the cream off of the index already.
For multiple-word queries, it gets even more hairy than that. Each docID from the matches on the first word have to be compared against the matches from the second word. Then you have to consider position on the page for those words.
The Term Vector Database: Fast Access to Indexing Terms for Web Pages by Raymie Stata, Krishna Bharat, Farzin Maghoul, which can be found here [www9.org].
Here are some of the hilarious observations by these authors:
"Rather than deal directly with URLs, the Connectivity Server uses a set of densely-packed integers to identify pages."
"Recall that page identifiers are a dense set of integers."
"To avoid wasting space, we pack vector records densely."
"Functions in the Connectivity Server convert between these integers and text URLs. In our work with the Connectivity Server, these identifiers have proven more convenient to handle in code than text URLs."
"Notice the use of integers to represent terms; as with page IDs in the Connectivity Server, we find these to be more convenient to manipulate than text strings."
"The page ID of the vector is stored in the first 4-bytes of the vector's record."
Hilarious stuff. Bharat is a member of the research staff at Google and has a Ph.D. in computer science. I fell out of my chair laughing at this paper, because he could have used a 12-byte ID, but instead he used a 4-byte ID.
(Hey everone, I'm being sarcastic. Bharat obviously knows what he is doing.)
Here's my question: What right does GoogleGuy have to spin things the way he does, and escape all accountability for his spinning? That's the question that the moderators have to answer in order to preserve WebmasterWorld's integrity.