Forum Moderators: open
Now non-image URLs has an ID stored in 4 bytes, so google is now running out of IDs for stored pages.
When there will be no URLs returned 'not found' and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages, after that google will stop adding new urls from indexed pages as well as new URLs added for indexing.
They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes.
This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplyed by 1.17.
This procedure will require 1000 new page index servers and additional storage for temporary tables.
They are hoping to make this change gradually server by server.
The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.
Until then new URLs will not be indexed, except those which will be put in place of URLs returned 'not found' and deleted from the index.
[just a guess but who knows]
[edited by: re5earcher at 12:41 pm (utc) on June 7, 2003]
On large systems you cannot afford to "leave plenty of space just in case" you gotta design tight. I mean google is something, that 10years ago lots of folks would've said can't be done. (like what folks say about instant PR today)
Also, if such a limit is reached "while nobody's watchign" it can REALYL **** things up, such as loosing or corruptign an update. Furthermore, a google update is not like a prograss bar that someboidy watches. It's a huge distributed operation happening all over the place that no single person can overlook or know at what state it is. So missing the update before the last caneasily happen. Some folks seem to think that google, just cos they created a big bad system, they're infallible... in fact all practical big bad systems in operation today are majorly flawed.. the nature of today's IT ;)
SN
PS: again, not saying this has happend, jsut pointing out that it could'Ve easily, more then most people believe.
My guess is that the Google knew about this problem long time ago. Their page count stopped around 3 billion. Now you are guessing that last update could have been a bad one and they could have lost data, my guess is that it could have been a "fake" one! ;)
Google was working on its problems but the webmasters were waiting for the update so it decided to have a fake Dominic crawl. This kept people from getting impatient and spreading rumors about it. Now, as is true with most large systems, their problems didn't get resolved in time. Mow they are not even pretending! Just speculation. :)
I am certainly not part of the panicking crowd. I find this thread absolutely fascinating.
I believe it is a very logic theory about the Dominic bruhaha. I consider it has a less negative impact than other thinghs we could imagine.
>>just pointing out that it could'Ve easily, more then most people believe
He he! reminds me of what happened to Altavista a few years back.
But sounds like a pretty valid theory. Especially considering those facts :1)4,294,967,296 is a 32 bit value limit
2)Google's home page claims "Searching 3,083,324,652 web pages" since a while plus non standard documents.
3)Back in Januany, some Google founder claimed he hoped to expand to 10 billion records this year
* Freshbot doesn't go through the usual deepbot indexing process, and may not generate extra docID's, so this would also explain why "freshie has been crawling deeply in addition to normal freshie duties."
* Why this update is SO much work for google, and going to take so long.
* talk from GG about "worst-case" backup
Also I'd like to point out, that originaly (before editing) the poster of the first message in this thread had "from inside sources" and not "just a guess".
And I'm not saying this is an unexpected problem for google, but it may just be what this update is about.
No more, no less, for me at least. I just love this stuff ;) And of course the idea that great Google might suffer thesame troubles as I once did. So I'm not exactly unbiased either ;)
SN
>>talk from GG about "worst-case" backup
OOps I missed that one! Do you remember if it is in the "understanding Dominic" thread?
If I was doing what is suggested Google is doing, I would be talking like GG in that thread! "Cross verification, worst-case backup, cautious about the update, switch to as a safety net..."
How about a different hypothesis.
Not "Dominic has gone wrong"
and
Not "Tthey've hit 4 gigs and it's broken."
Tell you what, I'll start a *different* rumour...
The REASON Dominic has take so long is because they've just got rid of the 4 gig limit.
After all, the database *is* a different format, everyone's already said that.
So my theory is...
The 4 gig limit isn't a problem, Dominic was *specifically* brought in to solve it.
Back to my wine while you prove me wrong! <wink>
Regards
DerekH
Thanks,
Sally
<--EDIT:
Eaden, Right! I don't know why, but I get the same thing.
The link is correct.
I found that it works if you cut and paste it to your browser. Must be some kind of WebmasterWorld, or maybe SBC-Yahoo glitch.-->
[edited by: Sally_Stitts at 8:29 pm (utc) on June 7, 2003]
GG, there has been a lot said in this thread, so 'nah' could refer to anything...
I take it by your reply that the number of bytes in a page index field hasn't changed from 4 to 5 in the last 6 months or so, and nothing of this sort is happening.
[edited by: eaden at 8:03 pm (utc) on June 7, 2003]
Maybe these 15,000 pages lost their ID number by getting overwritten with other pages using that ID, and that's why they ended up as URLs that Google could display, but not reference in their index.
Anyone else see weird stuff on April 11?
Not only were they kind enough to squeeze us in, they even managed to give us many fine placements in the SERPS for our keywords.
The SEO game brings new meaning to the phrase "survival of the fittest". Darwin would be proud.
Have a look at this
bvte6I7cVAZy
It's a URL-ID from looking at the ID's generated to speed up the cache lookup, it's 12 digit alphanumeric. It is shown for the cache of any page on a SERP.
26 lower case characters(correct?), 26 upper case characters, 10 digits, makes for 62 possible characters for each of the 12 digits so total combination would, in theory be 62^12 - more than enough for our measly 4 billion pages.
Well so much for the rumors I've heard, I guess our theory dies here.
If you do this search in Google, you'll find the mirror of the site you are looking for. Or at least you'll be able to read the Google cache in html...
"c-6.powers_of_2.4.pdf"
See - Google's not broken for the general public - its just 'under construction' - and the major impact is on sites that have been added, added to, changed in the past 9 weeks or so. And if those sites were 'news sites' - they aren't affected. So its a small % of the 3 billion.
BTW - try the same search to find Sally's mirror at AllOverFAST...
Don't you all think its unusual that re5earcher hasn't reposted at the scene of the crime? As I said about 4 pages back - he's hiding outside (with Marvin the Martian) - waiting for the 'kerboom'.
Chris_D
>>As others have said, changing the width of a single field is trivial.It may be, but will have dramatic consequences on a DB of this size. Something users would have noticed right away.
Where's this "python" stuff coming from? The only thing public that I've heard is the core is pure ML with the rankings routines C on top and glued here-n-there with the internet duct tape (Perl).
> How do you get 3 billion of anything on an
> 80 gig hard drive?
That is probably the most important statement in the whole thread. The index can be expanded, but you can only cram so much of it on a disk that size. Reportidly, that is the size of G's first 10k of boxes.
> john316, last thing I have heard is that Google
> database was stored in RAM.
Not the db, just the index file (key pointer file).
> Is a period of 2 months not too short for
> such a drastic change?
Exactly - could be why the april crawl was tossed.
Then again, a switch from monthly indexing under the old system to daily indexing via freshbot is a monumental change.
What if a very public and visible company saw that it had no critics in it's market space? Every marketer knows that nature abhors a vacum - the market generally fills the available space with crap if you don't fill it yourself.
What would you do if you were them? Wouldn't you want to fill that space with your own arguments and support - even if negative in appearance just to give the appearance of critics?
Personally, I'd fill it with shallow indefensible arguments and hire a self promotion expert to promote it from an "anti" site. I'd get them to go out and talk about things like Cookies, or promote a great social welfare agenda from the Ralph Nadar crowd that is no longer cool. I'd get them to think up conspiracy theories and to RF (ex: Donald Segretti wrote the humphery letter in 68) anything they could. [google-watch.org...]
GoogleGuy hasn't denied this is what they are/have been doing so it's still a possibility. Of course Google hasn't reached it's ID capacity limit - the limit would be increased before that ever happens.
Last I read their crawler was in python, I think Brett mentioned more above about the actual index.
Also they have their own way of encoding things (not huffman)....both of these facts are in the dino-backrub paper ;)
>>>>4 bytes/32 bits
This search [google.com], it's increased 400 million over the last month, if it passes over 4 billion, say goodbyte to your 4 bytes :)
Hopefully this Google Watch summary page will encourage more discussion of what's going on at Google, even if webmasterworld.com won't.
From google-watch
Hmmm, I wonder what pages he is reading.
There are probably more theories on this site (and almost as much "evidence") on this supposed google bug than the kennedy assassination.
Yeah - I am sure this is the reason....