Why does the 'Google Lag' exist?

Forum Moderators: open

Message Too Old, No Replies

Why does the 'Google Lag' exist?

Trying to understand its purpose.

bakedjake

1:43 am on Sep 29, 2004 (gmt 0)

I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.

I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.

I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.

So, why does the sandbox exist?

The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?

isitreal

10:49 pm on Oct 2, 2004 (gmt 0)

critter, have you read the original white papers? Give them a read again if it's been a while. They were published not that long before google went fully live, this stuff doesn't get changed yearly. But it does need to get changed.

entry. Most likely the URL's identifier is the question mark portion.

I read this argument about a year ago. It wasn't impressive then, and it's not impressive now.

Critter

10:59 pm on Oct 2, 2004 (gmt 0)

Yes, I've read the original white papers.

Where does it say anything about 32 bit integers in there?

Point is, even *if* there was some 32 bit value for document ids (dubious) it would take *nothing* to assign a version number to files, update the id to a longer value, and update the SE and crawl programs to recognize the version of the files as they served pages/crawled.

SlyOldDog

11:15 pm on Oct 2, 2004 (gmt 0)

With all due respect, I don't think unless you are sitting at Google and you know their architecture you can say much about what would take *nothing* to do.

When you are networking large numbers of computers there have to be factors that normally don't come into play that many of us would not even think about (I can't speak for you of course).

Critter

11:27 pm on Oct 2, 2004 (gmt 0)

Alright, alright. You're all correct then.

Google probably *does* use 32 bit integer document id's and most likely the entire cluster runs on Commodore 64s.

If there's anyone over at Google reading this thread they're peeing themselves laughing right now.

Scarecrow

11:46 pm on Oct 2, 2004 (gmt 0)

I'm a convert to the idea that Google is migrating to 64-bit Linux with a 64-bit file system. Presently they have a "virtual" 64-bit file system that involves lots of Ethernet links and distribution networking behind the scenes to go out and fetch the data that make up each chunk. With the addressing power of a real 64-bit system, Google would improve performance all across the system, and quite dramatically. If the cost is as low as isitreal says, then it's a complete no-brainer to migrate toward 64-bit computing. Look what they have to go through with their present 32-bit system:

"Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range."
"The Google File System," by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Critter, a single inverted index consists of one docID per word per web page. Look at the space required for 4 bytes, assuming an average web page of 300 words. I've multiplied by two here because Google uses an average of two docIDs per word per page. That's because they have both a "fancy" and a "plain" inverted index, and also because the docID is used elsewhere in the system:

4 bytes: 300 * 4 billion * 4 * 2 = 9.6 terabytes

The first thing that happens when a search is requested is a lookup in the inverted index. To distribute this load, multiple copies of this index probably exist in each data center. Multiply the above by some unknown number.

This is a lot of space. That's just the space issue connected with the docID, whether it's all in memory or all on hard disk. Everything I've ever read about inverted indexes mentions the importance of compression. You cannot compress further and get more than 4.29 billion unique ones and zeros in 4 bytes (32 bits).

Now add the performance issue of the extra CPU cycles to fetch an expanded docID. (We're not even talking about calculating PageRank, because I think Google realized this was dead 18 months ago.)

Moving to 64-bit computing makes a lot of sense. They can define a new 5-byte integer type in the math library if they want to save space. But the point is, a 64-bit CPU could fetch this new type in one pass instead of two, and you don't take a performance hit.

Critter

11:59 pm on Oct 2, 2004 (gmt 0)

Replicate the inverted index 4 times, distribute it over 2,500 machines and that's 12GB per machine. That's not to say that 2,500 machines isn't a low estimate. There's a lot of distribution at Google, we know that much already. Adding another couple thousand machines so that they can take advantage of a larger DocID isn't much of a stretch. Also it's reasonable to think that the *machine id* that the index is stored on is part of the DocID, further expanding the possibilities.

The lookups have to be painfully slow no matter which way you slice it, because of the numbers you get at the top of the results page (typical values fall in the 0.15 to 0.4 seconds range). The Google searches are so slow they *have* to distribute things around just to keep up with the requests. At 0.25 seconds average for a lookup and the probably 4,000 searches per second they get during peak periods they'll need 1,000 machines just to handle the load.

In my view a reported 10,000 machines at the plex and elsewhere easily handles a distributed inverted index/repository/etc with a larger DocId.

arthurdaley

12:07 am on Oct 3, 2004 (gmt 0)

I dont think its handling lookups that is the main problem when adding more than 4.2 billion pages to the index. The biggest bottleneck would be calculating pagerank. At the moment using the 32 bit system they can store each page id as a straightforward integer. Adding a workaround to handle more than the 32 bit limit would drastically impair the speed to calculate PR.

Critter

12:23 am on Oct 3, 2004 (gmt 0)

Solution to pagerank: Do iterations over time, as things are crawled/stored. Don't do them all at once. Then pages move up and down in the index slowly over time, not all at once in a "dance".

Critter

12:24 am on Oct 3, 2004 (gmt 0)

I got a semi-intelligent question:

It seems to me that pagerank, with its "iterations" would be well-suited to calculus, as the pagerank for a particular page or pages clearly would approach a "limit".

Anyone ever done anything with this?

isitreal

1:31 am on Oct 3, 2004 (gmt 0)

In about 2000, google was at 6000 machines, with I think about 1 billion pages indexed. Obviously harddrives have jumped up in size, so each machine can store more data.

Oh, critter, you really need to go back and reread the thing before making the types of comments you're making, your memory is playing tricks on you, or you just skimmed over this:

Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits.

Is there something about that sentence that is unclear? Two types of hits, each 2 bytes. Thats 2+2Bytes, that's 4 bytes.

This 354 message thread spans 36 pages: 354