Forum Moderators: open

Message Too Old, No Replies

Why does the 'Google Lag' exist?

Trying to understand its purpose.

         

bakedjake

1:43 am on Sep 29, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.

I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.

I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.

So, why does the sandbox exist?

The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?

renee

1:24 am on Oct 2, 2004 (gmt 0)

10+ Year Member



hallelujah to you arthurdaley.

jnmconsulting - yes this is true re- the +the query. but if you look at the google home page as touched on by slydog above, it says:

"©2004 Google - Searching 4,285,199,774 web pages"

and this number has been at this level for months now. maybe google engineers just forgot to update their home page.

also the 5b+ results for the "+the" query has been at that level for a year now. does it mean no pages with "the" has been added in a year's time?

Marval

1:49 am on Oct 2, 2004 (gmt 0)

10+ Year Member



The number of pages indexed cannot be the answer because we get new pages included every day - many of us put new pages on existing sites and they are included within 2-7 days. This only affects sites that are on new domains or older domains that never had a site (ever including the archive org listings) and has been affecting sites since last fall.
Ive heard people theorize that the page limit number may be staying the same and that the new pages we make on existing sites are replacing older pages that havent been updated - doesnt sound very likely as it would take a massive amount of horsepower on already hard working servers - not Googles way of doing things - they look for easier streamlined ways of doing things these days

dauction

3:03 am on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This "sandboxing" all started with Google going public.. it is relational to that process imo

The easist way to show strong profits is to cutback on labor , hardware , and investments in research ..

It could be that the investment bankers /firms wanted "assurances" prior to IPO that G would show immmediate continuing PROFITS .. These investment firms cannot profit themselves unless G's stock rises ..so in G's very first 10Q if G reports a loss or stagnant profits ..the stock price drops and the big investment firms are not going to be happy...

So maybe G is in a conservative mode..maybe G isnt allowed to make changes on a dime anymore.Maybe Changes are stuck in the board room now?

Maybe sites are sandboxed because the expense of adding new sites and all that surronds those entries is enough that they negatively affect earnings?

Isnt this the strategy that MSN did for the longest time...they had high profit margins on their search because they cut back on the expense of updating and adding new sites..that works short term but they played it so long that now they are having to totally rebuild search.

I dont know ..just something to consider..maybe G just cant shoot from the hip anymore..they are now a public company and every move will need layers of approval ...

SlyOldDog

8:44 am on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Maybe this, maybe that, maybe not...

The only fact is the number on their home page which has not changed since February 17th. The number is 8 million pages short of their theoretical limit, so that still allows them to add new pages to existing sites in minute quantities.

Ciml already debunked most of the theories here. In my opinion correctly.

The Adwords theory doesn't hold water either. New entrants will be likely to spend less on Adwords than bigger organizations who are already in the index and get removed. Existing sites in the index will already have built significant online businesses that need supporting. For many of them it's pay-or-die as opposed to pay-to-play for new entrants.

BeeDeeDubbleU

9:05 am on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Pardon me for seeming bemused but isn't it obvious? For some time now (about a year) a theory has existed that Google would not be able to index more than 2^32 pages.

Am I missing something? Google is still indexing all of these new sites, at least in my case. The problem is one of ranking - not indexing.

Oliver Henniges

10:34 am on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Quite a while ago I suggested my 32 bit theory but calum once said he wouldn't believe it be the case. The figure 4.28 billion websites indexed (i.e. valued in page rank) exactly remains below the 4.29 billion given by 2exp32. And now - with google remaining there for three months - I'd regard this theory worth being discussed again.

I definietly am not an expert, but I believe the PR-algorithm is heavily based on a 32-bit-hardware architecture. As far as I know, PR is calculated by approximation thru about 100 iterations over the 4.29 billion cross 4.29 billion matrix, which means a huge number of calculations.

Note how much Larry Page and Sergey Brin emphasize the factor of speed in their original paper, and I do not think this only concerns request-traffic on the net.

Below this figure you might work with a "long" index variable on the matrix. Beneath it you need at least a "double." Nor am I an expert on c++ or processor-technology, but I suggest the difference implies much more than just doubling calculation time. It seems to require a complete restructuring of the algorithm, and it might well be the case that it is even impossible to find a solution for this problem at current state of technology.

I have put quite a lot of effort in improving my websites PR in the past three months so please, please falsify my theory.

Thx Oliver

[edited by: ciml at 10:40 pm (utc) on Oct. 7, 2004]

mfishy

11:45 am on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The sandbox/google lag/whatever is a ranking, not an indexing issue. It is absolutely critical to understand that.

Yes, to say anything else would be to discuss a different phenomena

Marcia

12:17 pm on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>The sandbox/google lag/whatever is a ranking, not an indexing issue. It is absolutely critical to understand that.

>>Yes, to say anything else would be to discuss a different phenomena

They are absolutely, positively in the index, just like any other pages. And they're cached just like any others.

This is G o o g l e's text-only cache of [example.com...] as retrieved on Sep 9, 2004 12:29:04 GMT.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
Click here for the full cached page with images included.

The pages get updated (verified by changing "last updated" on the pages), and in fact as new pages are added to the sites they're indexed as well. One has had a fresh date for the homepage. One is even showing up in the backlinks for another site. They just won't rank for anything.

mfishy

12:28 pm on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It just means you're not able to compete with all the other sites that are better SEOed!

That would be odd, considering that it is my old sites currently ranking. Maybe I am growing old and out of touch and cannot even compete with my "younger self" :)

A blind man could see that newer sites are being treated differently than older ones and a 3 year old can get a new site INDEXED.

arthurdaley

12:34 pm on Oct 2, 2004 (gmt 0)

10+ Year Member



>>The sandbox/google lag/whatever is a ranking, not an indexing issue. It is absolutely critical to understand that.

>Yes, to say anything else would be to discuss a different phenomena

Depends on the intended semantics of 'indexing'. It's certainly not a spidering problem as they have spidered many pages which are not in the index/ searchable database. But if there is a limitation on the number of pages in the index and old pages must be kicked out before new ones can enter, then that could be called an 'indexing' issue since the issue is caused by limitations in the size of the index/ searchable database.

Also as I understand it, although related the sandbox and google lag are not the same thing. The sandbox affects pages already in the index, seemingly penalising some of them. The lag is a time period delaying the acceptance of pages not yet in the index.

This 354 message thread spans 36 pages: 354