Why does the 'Google Lag' exist?

Forum Moderators: open

Message Too Old, No Replies

Why does the 'Google Lag' exist?

Trying to understand its purpose.

bakedjake

1:43 am on Sep 29, 2004 (gmt 0)

I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.

I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.

I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.

So, why does the sandbox exist?

The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?

Critter

3:02 am on Oct 3, 2004 (gmt 0)

Oh man, you're kidding me right?

The two bytes they're talking about are used for the *position of the word* in the document, and have nothing to do with the DocId.

Furthermore, plain hits or fancy hits have some of their bits used for capitalization and such, so the amount of bits available for position information are further reduced. If you'll read further in the paragraph you quoted me you'll see that there's 12 bits of position information (out of 16) for plain hits, and 8 bits of position information (again, out of 16) for fancy hits.

Hits are included in forward barrels, which are represented as the second figure near the aforementionned paragraph. Forward barrel records start with a docid and wordid, then are filled with "hits", each of which is in the hit list format outlined above. The document id length, you'll notice in the paper, is not specified.

If you don't understand the paper (clearly, you don't) please don't quote from it. :)

isitreal

3:10 am on Oct 3, 2004 (gmt 0)

sorry critter, my mistake, it's getting late, thanks for the correction though, the point is to get a working hypothesis that explains the most in the simplest way, same old same old. You're right, it's not specified, I spaced that one. However, since we can pretty easily see for ourselves the state of the page index count, it's a moot point I'd say.

Assuming a 32 bit docID is hardly unreasonable given a 32 bit OS and processor, and the current page count on the g index page. Since the numbers add up fine, I don't have any real problem with it, plus there's the real bonus that I can explain certain things to my own satisfaction, then go onto more interesting problems, since there's very little I can do about this current situation.

The test of the assumption is how well it explains the behavior you're seeing, that's up to anyone to do for themselves, you know, think for yourself and all that quaint stuff. I'm not trying to prove any point, it's not necessary for me, my livelihood doesn't depend on being right or wrong about basic points like this, I don't have any ego involvement in this either, it's a purely pragmatic thing.

SlyOldDog

6:50 am on Oct 3, 2004 (gmt 0)

>>It seems to me that pagerank, with its "iterations" would be well-suited to calculus, as the pagerank for a particular page or pages clearly would approach a "limit".

For calculus you need a formula. That formula would change for each web page would it not? Different numbers of backlinks, different damping factor. And as soon as a backlink is added or removed, it changes again.

I don't think usuing calculus would make it any faster than passing a few iterations. Let's face it, the number is not worth much anyhow, so you don't need to know it exactly.

claus

1:17 am on Oct 4, 2004 (gmt 0)

...there are 10 types of people in this world...

Geez... 183 posts, two hours of reading, and it got me ...here?

Although the Google capacity debate is interesting (and i see some signs that could be interpreted to support theories above, although i personally interpret them in an entirely different direction) i do not believe that this debate is on topic here.

Even if Google had thousands of "indexes", this still does not change that the affected sites are indexed, in one index or another. As others have remarked, this is a question of ranking.

The topic is Why does the 'Google Lag' exist?

To answer that question, you would have to establish what the "Google Lag" is first. Then, of course, you would have to agree that it does exist. For simplicity, let's assume that it does exist.

Then, what is it? The thing most people here tend to agree on is that:

New sites (ie. pages on new sites) get indexed as usual, but
they are, sometimes, quite slow to get PR (which isn't that unusual nowadays), and
regardless of PR they seem to rank okay for "uncompetitive queries", and
they don't rank well for "competitive queries"

Ie. "competing just got tougher - for new sites". Now, the question "Why does this exist?" could be discussed along the lines of eg.

Why does this (seem to) happen for new sites only?
Would this be intentional or would it be a sideeffect of something else?
What does "old sites" have, that "new sites" doesn't?
Why would anyone want to design a mechanism that did exactly this?
If somebody designed a mechanism that did something else, but had this as one of its effects, what could that mechanism possibly be?

There's been a few interesting suggestions already, imho. Personally, i lean towards the "sideeffect" theory, eg. that this is a symptom, not the real issue, or "just the tip of the iceberg", so to speak.

As to "Why", that's close to a no-brainer to me, as i'm convinced that it's all about improving the quality of search, believe it or not ;)

The "How" question is the more interesting one, imho - and it just might lead to another definition of what it really is. Just to make one point in that direction: "How could search be improved if new sites are ranking bad for a while?", or "What types of search improvements would introduce a lag/latency in the ranking of new sites?"

So, back on topic... i hope...

Oliver Henniges

7:09 am on Oct 4, 2004 (gmt 0)

Sry for probably having driven this thread OT.

> What does "old sites" have, that "new sites" doesn't?

> they are, sometimes, quite slow to get PR

Sry again for not doing research on that but is there at least ANY one during the past three months?

> Why would anyone want to design a mechanism that did exactly this?

Because it probably may not have been designed on purpose but a hint to a severe technical problem within googles present architecture. This is the reason why I have chosen this thread for my question, because I thought it might be a simple and reasonable answer to "why does the lag exist?"

Only popping in here and there sporadically and not really getting entangled in expert's dicussions in the past two years, I have - with growing astonishmet -watched a strage trend towards speculating about the absurdest facettes of specific filters, constraints or punishments in googles algorithms.

On a very fundamental level all this speculating comes quite close to trying to find a solution for the halting problem of a turing machine, and somtimes reminds me of interpreting delphian priests in ancient greek.

With google now being on the markets we are faced with quite a different situation. It does not only imply that google has more money and power than ever before, but also that she is much more vulnerable and relying on such things as Public(=customers) opinion, press releases, rumours and the like. The times when webmasters had to beg "please googlegirl help" might be over. Heads up.

Anyways: EOT fort he "four-byte-aspect" within this thread and if Scarecrow feels there is new evidence on disussing this more deeply, he might start a new one on that.

Oliver

BeeDeeDubbleU

7:58 am on Oct 4, 2004 (gmt 0)

GOOGLE FLAW

(tell the press) ;)

charlier

8:17 am on Oct 4, 2004 (gmt 0)

One point about 'new' sites, it also seems to affect sites that are new only in the domain name. I moved an old site and all its links to a new domain and boom gone from the SERPs for terms that I was on the first page for on the old domain. This has been so since I moved a site in July, all pages indexed, backlinks showing (>10000), top spot for very minor keywords and not in the top 1000 for the major keywords. Also, this isn't a real 'money' competitive area.

randle

2:00 pm on Oct 4, 2004 (gmt 0)

The thing I keep coming back to when contemplating all the theories, especially; "it's a side effect" is that for all practical purposes no one has ever gotten out of this thing. It is not just a prolonged entry into the results for competitive terms, its a brick wall.

Ever since March 2004, for the great majority of us, no one has had success launching a new site and then gaining organic traffic from the very key words they designed the site for. I know some have claimed to have gotten out of the sandbox, and my hats off to them, but those cases are rare.

Many of the fine explanations put forth don't address this aspect; in seven months no one has gained significant organic traffic from a newly launched site.

renee

3:28 pm on Oct 4, 2004 (gmt 0)

>>"no one has ever gotten out of this thing. ....but those cases are rare."

I believe my explanation answers these questions. since this thing is due to an out-of capacity problem in google's main index, the only way new sites can get in will be if some old sites are remove. and since google cannot remove too many of these too fast, then you see why there are new sites that get in and very few.

i also thing google randomly selects which sites get out of the main index and randomly selects which new sites get into the main index. this explains while some sites get in in a few months and others take much longer.

by saying it a site gets into the main index means it now has the potential of ranking in the serps (to answer those who say it is a ranking issue, not an indexing issue).

claus, thanks for regurgitating so elegantly the ideas i've posted earlier. good job.

leveldisc

3:49 pm on Oct 4, 2004 (gmt 0)

So, back on topic... i hope...

So do I.

I believe my explanation answers these questions. since this thing is due to an out-of capacity problem in google's main index, the only way new sites can get in will be if some old sites are remove. and since google cannot remove too many of these too fast, then you see why there are new sites that get in and very few.

Back off topic then. I don't believe anyone has had any problems getting sites indexed.

This 354 message thread spans 36 pages: 354