homepage Welcome to WebmasterWorld Guest from 54.197.215.146
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 354 message thread spans 12 pages: < < 354 ( 1 2 3 4 5 6 [7] 8 9 10 11 12 > >     
Why does the 'Google Lag' exist?
Trying to understand its purpose.
bakedjake




msg:112198
 1:43 am on Sep 29, 2004 (gmt 0)

I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.

I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.

I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.

So, why does the sandbox exist?

The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?

 

Critter




msg:112378
 3:02 am on Oct 3, 2004 (gmt 0)

Oh man, you're kidding me right?

The two bytes they're talking about are used for the *position of the word* in the document, and have nothing to do with the DocId.

Furthermore, plain hits or fancy hits have some of their bits used for capitalization and such, so the amount of bits available for position information are further reduced. If you'll read further in the paragraph you quoted me you'll see that there's 12 bits of position information (out of 16) for plain hits, and 8 bits of position information (again, out of 16) for fancy hits.

Hits are included in forward barrels, which are represented as the second figure near the aforementionned paragraph. Forward barrel records start with a docid and wordid, then are filled with "hits", each of which is in the hit list format outlined above. The document id length, you'll notice in the paper, is not specified.

If you don't understand the paper (clearly, you don't) please don't quote from it. :)

isitreal




msg:112379
 3:10 am on Oct 3, 2004 (gmt 0)

sorry critter, my mistake, it's getting late, thanks for the correction though, the point is to get a working hypothesis that explains the most in the simplest way, same old same old. You're right, it's not specified, I spaced that one. However, since we can pretty easily see for ourselves the state of the page index count, it's a moot point I'd say.

Assuming a 32 bit docID is hardly unreasonable given a 32 bit OS and processor, and the current page count on the g index page. Since the numbers add up fine, I don't have any real problem with it, plus there's the real bonus that I can explain certain things to my own satisfaction, then go onto more interesting problems, since there's very little I can do about this current situation.

The test of the assumption is how well it explains the behavior you're seeing, that's up to anyone to do for themselves, you know, think for yourself and all that quaint stuff. I'm not trying to prove any point, it's not necessary for me, my livelihood doesn't depend on being right or wrong about basic points like this, I don't have any ego involvement in this either, it's a purely pragmatic thing.

SlyOldDog




msg:112380
 6:50 am on Oct 3, 2004 (gmt 0)

>>It seems to me that pagerank, with its "iterations" would be well-suited to calculus, as the pagerank for a particular page or pages clearly would approach a "limit".

For calculus you need a formula. That formula would change for each web page would it not? Different numbers of backlinks, different damping factor. And as soon as a backlink is added or removed, it changes again.

I don't think usuing calculus would make it any faster than passing a few iterations. Let's face it, the number is not worth much anyhow, so you don't need to know it exactly.

claus




msg:112381
 1:17 am on Oct 4, 2004 (gmt 0)

...there are 10 types of people in this world...

Geez... 183 posts, two hours of reading, and it got me ...here?

Although the Google capacity debate is interesting (and i see some signs that could be interpreted to support theories above, although i personally interpret them in an entirely different direction) i do not believe that this debate is on topic here.

Even if Google had thousands of "indexes", this still does not change that the affected sites are indexed, in one index or another. As others have remarked, this is a question of ranking.

The topic is Why does the 'Google Lag' exist?

To answer that question, you would have to establish what the "Google Lag" is first. Then, of course, you would have to agree that it does exist. For simplicity, let's assume that it does exist.

Then, what is it? The thing most people here tend to agree on is that:

  • New sites (ie. pages on new sites) get indexed as usual, but
  • they are, sometimes, quite slow to get PR (which isn't that unusual nowadays), and
  • regardless of PR they seem to rank okay for "uncompetitive queries", and
  • they don't rank well for "competitive queries"

Ie. "competing just got tougher - for new sites". Now, the question "Why does this exist?" could be discussed along the lines of eg.

  • Why does this (seem to) happen for new sites only?
  • Would this be intentional or would it be a sideeffect of something else?
  • What does "old sites" have, that "new sites" doesn't?
  • Why would anyone want to design a mechanism that did exactly this?
  • If somebody designed a mechanism that did something else, but had this as one of its effects, what could that mechanism possibly be?

There's been a few interesting suggestions already, imho. Personally, i lean towards the "sideeffect" theory, eg. that this is a symptom, not the real issue, or "just the tip of the iceberg", so to speak.

As to "Why", that's close to a no-brainer to me, as i'm convinced that it's all about improving the quality of search, believe it or not ;)

The "How" question is the more interesting one, imho - and it just might lead to another definition of what it really is. Just to make one point in that direction: "How could search be improved if new sites are ranking bad for a while?", or "What types of search improvements would introduce a lag/latency in the ranking of new sites?"

So, back on topic... i hope...

Oliver Henniges




msg:112382
 7:09 am on Oct 4, 2004 (gmt 0)

Sry for probably having driven this thread OT.

> What does "old sites" have, that "new sites" doesn't?

PR

> they are, sometimes, quite slow to get PR

Sry again for not doing research on that but is there at least ANY one during the past three months?

> Why would anyone want to design a mechanism that did exactly this?

Because it probably may not have been designed on purpose but a hint to a severe technical problem within googles present architecture. This is the reason why I have chosen this thread for my question, because I thought it might be a simple and reasonable answer to "why does the lag exist?"

Only popping in here and there sporadically and not really getting entangled in expert's dicussions in the past two years, I have - with growing astonishmet -watched a strage trend towards speculating about the absurdest facettes of specific filters, constraints or punishments in googles algorithms.

On a very fundamental level all this speculating comes quite close to trying to find a solution for the halting problem of a turing machine, and somtimes reminds me of interpreting delphian priests in ancient greek.

With google now being on the markets we are faced with quite a different situation. It does not only imply that google has more money and power than ever before, but also that she is much more vulnerable and relying on such things as Public(=customers) opinion, press releases, rumours and the like. The times when webmasters had to beg "please googlegirl help" might be over. Heads up.

Anyways: EOT fort he "four-byte-aspect" within this thread and if Scarecrow feels there is new evidence on disussing this more deeply, he might start a new one on that.

Oliver

BeeDeeDubbleU




msg:112383
 7:58 am on Oct 4, 2004 (gmt 0)

GOOGLE FLAW

(tell the press) ;)

charlier




msg:112384
 8:17 am on Oct 4, 2004 (gmt 0)

One point about 'new' sites, it also seems to affect sites that are new only in the domain name. I moved an old site and all its links to a new domain and boom gone from the SERPs for terms that I was on the first page for on the old domain. This has been so since I moved a site in July, all pages indexed, backlinks showing (>10000), top spot for very minor keywords and not in the top 1000 for the major keywords. Also, this isn't a real 'money' competitive area.

randle




msg:112385
 2:00 pm on Oct 4, 2004 (gmt 0)

The thing I keep coming back to when contemplating all the theories, especially; "it's a side effect" is that for all practical purposes no one has ever gotten out of this thing. It is not just a prolonged entry into the results for competitive terms, its a brick wall.

Ever since March 2004, for the great majority of us, no one has had success launching a new site and then gaining organic traffic from the very key words they designed the site for. I know some have claimed to have gotten out of the sandbox, and my hats off to them, but those cases are rare.

Many of the fine explanations put forth don't address this aspect; in seven months no one has gained significant organic traffic from a newly launched site.

renee




msg:112386
 3:28 pm on Oct 4, 2004 (gmt 0)

>>"no one has ever gotten out of this thing. ....but those cases are rare."

I believe my explanation answers these questions. since this thing is due to an out-of capacity problem in google's main index, the only way new sites can get in will be if some old sites are remove. and since google cannot remove too many of these too fast, then you see why there are new sites that get in and very few.

i also thing google randomly selects which sites get out of the main index and randomly selects which new sites get into the main index. this explains while some sites get in in a few months and others take much longer.

by saying it a site gets into the main index means it now has the potential of ranking in the serps (to answer those who say it is a ranking issue, not an indexing issue).

claus, thanks for regurgitating so elegantly the ideas i've posted earlier. good job.

leveldisc




msg:112387
 3:49 pm on Oct 4, 2004 (gmt 0)

So, back on topic... i hope...

So do I.

I believe my explanation answers these questions. since this thing is due to an out-of capacity problem in google's main index, the only way new sites can get in will be if some old sites are remove. and since google cannot remove too many of these too fast, then you see why there are new sites that get in and very few.

Back off topic then. I don't believe anyone has had any problems getting sites indexed.

Stark




msg:112388
 4:19 pm on Oct 4, 2004 (gmt 0)

Whilst the whole idea of running out of docIDs sounds very unlikely to me (and I'm not really that qualified to comment) the main argument against it seems to be that sites are in the index fine, it's ranking that is the problem - therefore the docIDs proposal is false.

However, what if the lack of IDs only occured in relation to PR calculations? Pages would still be indexed ok, but lack of docIDs for PR values would mean that they would basically have no PR assigned. This would then possibly tally with the lack of a toolbar PR update, and explain the indexing, but not ranking?

Or am I talking out of my a*se?

SlyOldDog




msg:112389
 4:40 pm on Oct 4, 2004 (gmt 0)

As pointed out above a few times.

renee




msg:112390
 5:20 pm on Oct 4, 2004 (gmt 0)

>>Back off topic then. I don't believe anyone has had any problems getting sites indexed.

let me repeat. if my theory is correct, the problem is getting sites into the MAIN INDEX. just like pages in the supplemental index are in the "index" but do not rank. why is this so difficult to understand?

once you accept that the sandbox is a separate index (JUST LIKE the supplemental index) then all the symptoms fall into place!

randle




msg:112391
 5:21 pm on Oct 4, 2004 (gmt 0)

Stark,

What your saying makes sense to me. I donít know what to make of the whole docID theory but as far as Page Rank is concerned that is a common theme with the sites we have in the sandbox. None of them has been granted PR I donít believe. Now itís possible they have, and I just donít know about it, as the tool bar is a suspect character these days.

However, we have four sites launched since March 2004 and none has shown any green on the tool bar and all are most definitely in the sandbox. For one of them thatís seven months not being granted something that we used to be able to obtain fairly quickly.

dirkz




msg:112392
 5:27 pm on Oct 4, 2004 (gmt 0)

> What does "old sites" have, that "new sites" doesn't?

We're talking about new sites ranking for all sorts of phrases but *not* for competitive ones.

If the lag springs from intention, this could make sense: What is so exciting about the 10 millionth widget site [take the most competitive medical term you can imagine]?

Most of the time the answer is "nothing". You can't expect anything groundshaking.

Whereas for niches, that is less competitive phrases, you can't predict anything. You wouldn't want to put a filter on something you don't know at all. Because every now and then, a phrase suddenly appears in Google's index that has never been there before.

leveldisc




msg:112393
 5:42 pm on Oct 4, 2004 (gmt 0)

let me repeat. if my theory is correct, the problem is getting sites into the MAIN INDEX. just like pages in the supplemental index are in the "index" but do not rank.

I have much older sites affected by the sandbox. These were indexed well over a year ago. SEO on these sites started in April / May with little effect, except they all rank in the top ten for 'allin' searches - common behaviour for sandboxed seo'd sites.

leveldisc




msg:112394
 5:49 pm on Oct 4, 2004 (gmt 0)

What is so exciting about the 10 millionth widget site [take the most competitive medical term you can imagine]?

Absolutely nothing. These are also the most spammed terms.

Previously the sandbox was thought to last 2-3 months. Maybe this was just the initial application of the effect. I doubt a 2-3 month delay would stop anyone, particularly those who prefer the murkier side of seo.

BillyS




msg:112395
 7:49 pm on Oct 4, 2004 (gmt 0)

Here is my guess on the lag... because this is how I would want to design a system this large...

Google is maintaining multiple databases and depending on the query, it redirects the to the appropriate database. The primary database responds to a list of "common" queries and is updated less frequently than the other databases. Google does this for performance reasons - fast response, low system demand.

The secondary database responds to any query not found on the "common" query list. This database is updated more frequently than the primary database and contains a much larger set of data.

Google does this because it wants fast response to any query the end user submits. Why waste horsepower on the common stuff? It's the Pareto Principle and it makes for efficient database design.

The lag exists because Google has a threshold that a site must meet before it is contained in the primary database.

- Some sites are always in there because they beat the threshold by a long shot.
- Some sites bounce in and out because they are on the edge of acceptance.
- New sites have a hard time making the threshold.

Google wants to limit the number of pages in the primary database, so the threshold can move each month. The better the primary database, the harder it gets for new sites to gain entry.

Google even tells us how large the primary database is, it's at the bottom of their query page. That language is much better than...

Searching 4,285,199,774 web pages unless you submit an uncommon query. In that case we look to our secondary database, which holds even more web pages.

Here's more for you 32 versus 64 bit folks:
Google does this because they have a large investment in 32 bit machines and they want to use those computers. The secondary database is a 64 bit design using recently purchased machines that are more expensive and computationally more powerful. However, they do not have enough of these machines to support the sheer number of "common" queries they receive.

SlyOldDog




msg:112396
 8:18 pm on Oct 4, 2004 (gmt 0)

Nice theory Billys but why update the smaller index less than the big one? Sounds like a misallocation of resources. Especially if the small index is the one most people see (most common searches).

caveman




msg:112397
 8:38 pm on Oct 4, 2004 (gmt 0)

Got the threshold part right, however. ;-)

BillyS




msg:112398
 8:47 pm on Oct 4, 2004 (gmt 0)

Nice theory Billys but why update the smaller index less than the big one?

Updated less frequently to stabilize the results and conserve resources. Common queries should have been answered many times over, no need to rush in with new answers.

It also makes for a better user experience. They type in a common two word query and get virtually the same results a month later. The end user appreciates this because it allows them to find things again. This raises confidence in the results thereby creating loyalty to Google.

graywolf




msg:112399
 8:50 pm on Oct 4, 2004 (gmt 0)

Ok for all of the people who say Google has reached capacity, how is it that you can add a new page to an "old" website and rank right away?

webdude




msg:112400
 8:55 pm on Oct 4, 2004 (gmt 0)

Here's more for you 32 versus 64 bit folks:
Google does this because they have a large investment in 32 bit machines and they want to use those computers. The secondary database is a 64 bit design using recently purchased machines that are more expensive and computationally more powerful. However, they do not have enough of these machines to support the sheer number of "common" queries they receive.

So how much are these machines? The 2 guys just got 64 million for the IPO. It would seem they could spring for the hardware.

renee




msg:112401
 9:45 pm on Oct 4, 2004 (gmt 0)

>>Ok for all of the people who say Google has reached capacity, how is it that you can add a new page to an "old" website and rank right away?

let me hazard a guess. at the time g ran out of capacity, it's solution was to create the supplemental index. it came to the point that just too many pages are being added particularly by new sites that just it became unreasonable to just shove pages into the supplemental claiming they qualify as "weird" queries as GG claimed. so google had to create another solution - a new index where it can quarantine new sites/pages.

since old sites remain in the main index, all new pages added remain in the main index and therefore participate in the pagerank algorithm and are able to rank. however, note that pages of old sites continue to disappear to make room for these new pages from old sites. that's the reason why google has not updated the "©2004 Google - Searching 4,285,199,774 web pages" which obviously applies to the main index. so the main index continues to be out of capacity.

i have a fairly large group of sites and i've been adding significant number of pages. however, i've noticed that my total number of pages in the main index (excluding supplementals) is not increasing at the same rate as new pages being added. I don't believe google limits the number of pages by domain. it's just that my group of sites are exhibiting the law of averages.

BeeDeeDubbleU




msg:112402
 9:45 pm on Oct 4, 2004 (gmt 0)

The facts that we do know for sure ...

Fact 1. New sites get indexed within a day or two.

Fact 2. New pages on existing sites get indexed the same way (and get found.)

Think about it. There is no real evidence to suggest that this is a capacity problem. This is surely not why it exists.

Now is it a Google defect? That's another story ...

renee




msg:112403
 10:01 pm on Oct 4, 2004 (gmt 0)

>>Fact 1. New sites get indexed within a day or two.

YES. they go to the sandbox index (or database).

>>Fact 2. New pages on existing sites get indexed the same way (and get found.)

YES. they go to the main index (or database) that's why they participate in the pagerank calculation and are able to rank in the serps!

this is a solution to the capacity problem in the same way that the supplemental index was created as a solution to the same problem. see my post above.

BillyS




msg:112404
 10:05 pm on Oct 4, 2004 (gmt 0)

So how much are these machines? The 2 guys just got 64 million for the IPO. It would seem they could spring for the hardware.

The point is not how much the new machines will cost. The point is that they do not have sufficient reason to abandon the "old" 32 bit machines.

cabbie




msg:112405
 10:45 pm on Oct 4, 2004 (gmt 0)

Really nice theories Billys and Renee.
I have no clue whether they are right or not but you have baffled me with enough science to make it sound plausible.

leveldisc




msg:112406
 10:56 pm on Oct 4, 2004 (gmt 0)

OK Renee.

In your model

1. How come I get a new site A to rank above old site B for some searches, but it's the other way round for other searches?

2. Why do new sites appear at the top of serps for the allin commands?

3. Why did my PR get updated in April for a sandboxed site?

4. Why do sites in the sandbox index appear in the link:www.oldsite.com from the main index

and so on.

Marcia




msg:112407
 11:09 pm on Oct 4, 2004 (gmt 0)

>>4. Why do sites in the sandbox index appear in the link:www.oldsite.com from the main index

Exactly. A link from a domain not even registered until July 1, 2004 shows up for link:www.mysite.com

5. Why are sites registered long ago, indexed and ranking for well over a year, now exhibiting some of the the identical symptoms as the sandboxed sites, except that their PR shows because of having been in the index prior to the TBPR lag?

What is the common denominator (or denominators) between the sandbox and Florida?

renee




msg:112408
 11:33 pm on Oct 4, 2004 (gmt 0)

>>1. How come I get a new site A to rank above old site B for some searches, but it's the other way round for other searches?

there are 2 posiibilities:
- if new site A is truly in the sandbox index, then the only way both and new sites appear in the same serps is that the query is non-competitive. i have seen queries yielding 25000+ serps which is a mixture of supplemental and non-supplemental pages. when this happens, the ordering/ranking of the serps is not pr based (it can't be since supplementals do not have pr!). in the example above with 25000+ results, the number one spot was for a supplemental page. this would explain what you see if the sandbox is a separate index like the supplemental.
- the other possibility is that the new site is not in the sandbox index. if it has pr and show up in backlinks, then it is not in the sandbox.

>>2. Why do new sites appear at the top of serps for the allin commands?

again it depends on how many results you get with the query. some of my allin searches show supplemental results, which is an indication that google is not using the main index solely for the specific serps. i keep using the supplemental index because we know for sure that the page is not in the main index. i'm using the behavious of supplemental pages (i.e. separate index) as a model for the behaviour of the sandbox index.

>>3. Why did my PR get updated in April for a sandboxed site?

if it has pr, it is not in the sandbox. the problems you have not ranking in the serps are due to other reasons - penalties, filters, you are out-seoed, etc.

>>4. Why do sites in the sandbox index appear in the link:www.oldsite.com from the main index

if a page appears as a backlink, then it is not in the sandbox. as in #3, look for other reasons

This 354 message thread spans 12 pages: < < 354 ( 1 2 3 4 5 6 [7] 8 9 10 11 12 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved