Why does the 'Google Lag' exist?

Forum Moderators: open

Message Too Old, No Replies

Why does the 'Google Lag' exist?

Trying to understand its purpose.

bakedjake

1:43 am on Sep 29, 2004 (gmt 0)

I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.

I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.

I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.

So, why does the sandbox exist?

The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?

Stark

4:19 pm on Oct 4, 2004 (gmt 0)

Whilst the whole idea of running out of docIDs sounds very unlikely to me (and I'm not really that qualified to comment) the main argument against it seems to be that sites are in the index fine, it's ranking that is the problem - therefore the docIDs proposal is false.

However, what if the lack of IDs only occured in relation to PR calculations? Pages would still be indexed ok, but lack of docIDs for PR values would mean that they would basically have no PR assigned. This would then possibly tally with the lack of a toolbar PR update, and explain the indexing, but not ranking?

Or am I talking out of my a*se?

SlyOldDog

4:40 pm on Oct 4, 2004 (gmt 0)

As pointed out above a few times.

renee

5:20 pm on Oct 4, 2004 (gmt 0)

>>Back off topic then. I don't believe anyone has had any problems getting sites indexed.

let me repeat. if my theory is correct, the problem is getting sites into the MAIN INDEX. just like pages in the supplemental index are in the "index" but do not rank. why is this so difficult to understand?

once you accept that the sandbox is a separate index (JUST LIKE the supplemental index) then all the symptoms fall into place!

randle

5:21 pm on Oct 4, 2004 (gmt 0)

Stark,

What your saying makes sense to me. I don�t know what to make of the whole docID theory but as far as Page Rank is concerned that is a common theme with the sites we have in the sandbox. None of them has been granted PR I don�t believe. Now it�s possible they have, and I just don�t know about it, as the tool bar is a suspect character these days.

However, we have four sites launched since March 2004 and none has shown any green on the tool bar and all are most definitely in the sandbox. For one of them that�s seven months not being granted something that we used to be able to obtain fairly quickly.

dirkz

5:27 pm on Oct 4, 2004 (gmt 0)

> What does "old sites" have, that "new sites" doesn't?

We're talking about new sites ranking for all sorts of phrases but *not* for competitive ones.

If the lag springs from intention, this could make sense: What is so exciting about the 10 millionth widget site [take the most competitive medical term you can imagine]?

Most of the time the answer is "nothing". You can't expect anything groundshaking.

Whereas for niches, that is less competitive phrases, you can't predict anything. You wouldn't want to put a filter on something you don't know at all. Because every now and then, a phrase suddenly appears in Google's index that has never been there before.

leveldisc

5:42 pm on Oct 4, 2004 (gmt 0)

let me repeat. if my theory is correct, the problem is getting sites into the MAIN INDEX. just like pages in the supplemental index are in the "index" but do not rank.

I have much older sites affected by the sandbox. These were indexed well over a year ago. SEO on these sites started in April / May with little effect, except they all rank in the top ten for 'allin' searches - common behaviour for sandboxed seo'd sites.

leveldisc

5:49 pm on Oct 4, 2004 (gmt 0)

What is so exciting about the 10 millionth widget site [take the most competitive medical term you can imagine]?

Absolutely nothing. These are also the most spammed terms.

Previously the sandbox was thought to last 2-3 months. Maybe this was just the initial application of the effect. I doubt a 2-3 month delay would stop anyone, particularly those who prefer the murkier side of seo.

BillyS

7:49 pm on Oct 4, 2004 (gmt 0)

Here is my guess on the lag... because this is how I would want to design a system this large...

Google is maintaining multiple databases and depending on the query, it redirects the to the appropriate database. The primary database responds to a list of "common" queries and is updated less frequently than the other databases. Google does this for performance reasons - fast response, low system demand.

The secondary database responds to any query not found on the "common" query list. This database is updated more frequently than the primary database and contains a much larger set of data.

Google does this because it wants fast response to any query the end user submits. Why waste horsepower on the common stuff? It's the Pareto Principle and it makes for efficient database design.

The lag exists because Google has a threshold that a site must meet before it is contained in the primary database.

- Some sites are always in there because they beat the threshold by a long shot.
- Some sites bounce in and out because they are on the edge of acceptance.
- New sites have a hard time making the threshold.

Google wants to limit the number of pages in the primary database, so the threshold can move each month. The better the primary database, the harder it gets for new sites to gain entry.

Google even tells us how large the primary database is, it's at the bottom of their query page. That language is much better than...

Searching 4,285,199,774 web pages unless you submit an uncommon query. In that case we look to our secondary database, which holds even more web pages.

Here's more for you 32 versus 64 bit folks:
Google does this because they have a large investment in 32 bit machines and they want to use those computers. The secondary database is a 64 bit design using recently purchased machines that are more expensive and computationally more powerful. However, they do not have enough of these machines to support the sheer number of "common" queries they receive.

SlyOldDog

8:18 pm on Oct 4, 2004 (gmt 0)

Nice theory Billys but why update the smaller index less than the big one? Sounds like a misallocation of resources. Especially if the small index is the one most people see (most common searches).

caveman

8:38 pm on Oct 4, 2004 (gmt 0)

Got the threshold part right, however. ;-)

This 354 message thread spans 36 pages: 354