|Why does the 'Google Lag' exist?|
Trying to understand its purpose.
I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
"That's a HUGE risk"
I think it would be riskier to have an entire legion of webmasters who have your number. SEO newbies and novices had learned how to game google with ease. If you don't figure out a way to make your algo a mystery, you'll soon be dead in the water. IMO.
|why would they spend time completely redesigning their algo and recreating the index when they already have arguably the best search engine already? |
Lots of companies and businesses have a tendancy to become complacent, they loose the drive that got them to the top to begine with. a successfull company will allways be striving to be better.
Alot of top companies have hit a tree while looking in the rearview mirror at the compitition.
that was a quote from newsweek a while back.
I'm not the most technical webmaster, but Google looks like an engine that is within 0.002% of a docID limit.
Surely, Ye Olde Time Google, without constraints, would have come up with infinitely more elegant solutions.
I vote for the new index theory, guessing we'll see it early 2005.
I read a couple of articles on this subject, however I cannot find them, my understanding was with some minor changes they would be able to overcome the DocID limit that was initially set up in the begining. In order to do this they have to reindex...looks like this is what they are doing. Does anyone know of any articles that explain in detail the DocID limitations?
"i think that pagerank is almost dead in the water"
Pagerank's importance may be minimized now, and may be minimized even more in the future, but I don't see it ever being dead. You have to have some fundamental way of doing the macro-sorting of webpages. Now, once you get beyond the big sort, you refine it further with relevancy criteria such as keywords, titles, links, anchor text.
If pagerank ever completely died, there'd be no way to do business startups on the web and get them going inside of a number of years (minus paying for advertising such as adwords). Even info sites would take years to ever get noticed particularly since they generally don't solicit links but get them organically over a period of years.
Of course, any really serious mucking around with the manner of indexing might have adsense and google-revenue implications
My take on this process is that Google has been doing some serious "data mining" for the last couple of years as it develops out its algo and engine...and tracks all sectors of it's SERPs...
Data mining is all about "trending" the data...looking for patterns and anomalies..
One consistent pattern would be ... that established Web sites with a history should show "normalized" behavior..which simply means..that these sites are addressing their respective "audience" through content initiatives...building link relationships (and these would typically occur through a one-to-one type process...not hundreds of new links suddenly showing up on the radar)..
and so on...
Today...when a new site that is trying to "suddenly" compete in a competitive area shows up in the SERPs with hundreds of inbound links, a huge number of content pages but no real history...this raises a red flag as far as Google is concerned...this would fall under the "anomalies" aspect of data mining..
It's a no-brainer for Google to see new sites and what they bring to the table...and also to track the competitive sectors and watch for this type of "SPAM" behavior...
or patterns to design responses to..
Google is simply trying to meter the process some with the "sandbox" initiative...and you can bet that the big established companies are "communicating" their needs to Google through "legislative type channels" (think Washington, D.C. and how Bills and Laws are developed and implemented through "lobbying")...
Google has to find a way to control this process as they continue to expand their global/multi-language reach or the SERPs will become a sea of useless information...
The collateral damage has to be an accepted factor in all of this...and yes...this means that sites that don't use apparent "SEO tactics" and simply want to address and serve their "visitor base" may be affected by these algo changes (and in some cases we know that Google will manually look into really aggressive situations)
The other argument is that Google is looking for ways to generate more revenue...so this "sandbox" process may be forcing some advertisers to step into the PPC thing while their "new" content is in the "silicon queue box"
(now back to my automated content generator and link spamming software...oooohaaahaaa! - just kidding)
Of course, any really serious mucking around with the manner of indexing might have adsense and google-revenue implications....
Think about update Florida, IF you searched for "real leather sofa"
#1 was a porn site and #3 was a porn site! i have lists upon lists of bad searches,
one company useing subs took the top 300 slots yes 300!
Did it effect adwords - adsense - or how long an engineer spends on his lunch break.... I think Not
Google know they have holes in the Index,so there are ways people can buck the system and earn a buck....they need to fix these holes, imo everytime they have plugged a hole they have made a another one, But thats common when firefighting problems, As an EX-database programmer sometimes all the fixes and patches cloud what you are try to achieve... it just makes sense to get a clean DB and run a brand new Algo on it. Can anyone rememeber what happened when they tried to intergrate the OCR spider into the natural spider LOL?
I've joined this thread rather late. You may have already answered this point:
"site wide links will push you into the sandbox if you're on the edge."
Why? Is it the loss of PR?
Midhurst I dunno why I just know it did for me. I have establshed sites with site wides that work just fine, but for new sites it seems to be like aiming a howitzer at your own penny loafers.
OK you are not sure why.
Have a look at the more recent posts in "can a load of new pages hurt"
This is also about the sandbox effect.
Do you want to respond?
PR diminution. LSI and new content.
Have a look.
It certainly isn't a penalty box, that's a misnomer. Penalties are for doing something wrong, and there's nothing wrong about a site being new.
I'm in complete agreement on this statement:
Data mining is all about "trending" the data...looking for patterns and anomalies
So anyone who uploads a large amount of new content, say 30% in one hit, is taking the risk the new content will be viewed as anomalous.
If the new content is examined for LSI and is thought to look like puff and wind it gets stuffed into the sandbox,yes?
We're all at it aren't we? Creating new content to impress Google that we are an authority site, but not being prepared to commission work from known authorities but plagiarising other peoples work, modifying it, and chucking it on the web. I think Google has the measure of that little wheeze and has tightened up on LSI recently.
If your definition of PageRank is that insane recursive formula defined by Page and Brin in 1998, which by 2002 took days to calculate after a full crawl of the entire web, then that's already gone. It disappeared in April 2003.
But what is PageRank in the larger scheme of ranking? It's a number that ranks the importance of the page, that is assigned without respect to the search terms that may be used to pull up the page. The key thing about this number is that it can be precomputed. Then the docIDs in the inverted indexes can be sorted by this number. That means you only have to scrape off the top of the docIDs for a search term -- just deep enough to satisfy the searcher's request for 10 to 100 links. You don't have to look at 99 percent of your index for most searches.
After you scrape off the top docIDs for a search, then you look at how each document relates to the search terms, using other algorithms. But this initial sort in the inverted indexes is probably the most crucial efficiency algorithm in Google's entire system.
Now this initial "PageRank" number certainly does not have to be the pure link calculation it was originally. Links are an obvious indication of importance, but the calculation doesn't have to be pure or recursive. If you did a seat-of-your-pants link calculation, you might want to consider other factors also. Remember, all these factors would blend into a number that is precomputed -- before you even construct the inverted indexes for searching. The inverted indexes are sorted on this number.
One thing that comes to mind is some measurement of the quality of a page in the context of the site. The original PageRank never looked at the site as a whole. But the more you know about the site, the more you know about the quality of pages that make up the site. Is the site spammy? Is it a .gov, .edu, or .org where the spam problem is less? Is it a new site? If new, does it have thousands of pages already? Is the site commercial or informational? If commercial, is it an affiliate site?
What if Google started keeping information on the nature of sites, and used this to weight the "PageRank" of the pages on that site? This would probably be the best approach to fighting spam.
In the Florida update, they tried to do something on the other end of the pipeline. Florida was an on-the-fly filter that was applied after the search terms were collected from the searcher. It didn't work too well. Maybe the semantic stuff was overrated internally at Google, by some engineers who had influence.
Now they may be working on the pre-computed part of the algorithm. I think they'll still call it "PageRank" (at least until all the lockups expire in five months and they all dump their stock), but it's going to be something more than PageRank. I suspect the logical direction is to evaluate the page as a member of a site. There are many fewer sites than there are pages, and it might be workable.
Something else I'll throw in here. My site, a 129,000 page nonprofit site, got a special crawl over the Labor Day weekend. It was special because it was manually dispatched. I know this because they grabbed all the pages, didn't ask for anything that was 404, and didn't ask for any of the sitemap pages. Every crawled page was sorted -- they crawled from the shortest URL to the longest URL. The only way they could have done a crawl this clean would be to either study my sitemap pages, or take my CSV dump of the deep page URLs, parsed out that field, and resorted.
I've never seen a crawl like this in four years. They crawled for 36 hours. Only two IP addresses were used. About every 25 minutes, they'd hit the site for around 2 minutes only. It was very methodical. The peak fetch rate I recorded was 40 pages per second. Yes, per second -- even though almost all pages are very small, and are all static, this tripped my load alarms. I survived and let them do their thing.
Why did I go off-topic to mention this? Because I'm not sure it's off-topic. I think it might be evidence that Google is no longer exclusively looking at the web as a bunch of pages, but as pages that belong to sites.
This could explain the sandbox effect.
So far, by the way, there is no evidence that this special crawl has kicked in.
If Google were to go into content analysis and site themes, how would searchers find the granular information they're looking for? And wouldn't everyone's response be to harmonize their site for their core keyword phrases, and be resistant to developing new content? That seems like a dead end.
Jake's original post was about why the sandbox exists and there's been some interesting suppositions:
* new pages are held off to stabilize the database - meaning Google thinks it's weak
* adding too many new pages or site wide links from other web sites could result in longer term sandbox time
* that the sandbox has some sort of dynamic activity going within it - pages disappear or increase ranking even if it is way down in the SERPs
* some report that they had a new site show a PR
* that Google is spidering (panic mode) for a completely new index coming in 2005
* that Google is assessing algorithm criteria against one another
* and one controversial comment that the sandbox doesn't apply to internal links.
I'd say the best speculation is that Google has lost confidence in the index quality and is developing/testing a new algorithm on a fresh set of data.
I'd like to hear from someone whose pages have come out of the sandbox and are now doing well.
Just curious...what does WebmasterWorld have against the term "sandbox"? Why is it a forbidden term here?
I don't see sandbox as being forbidden.
|and one controversial comment that the sandbox doesn't apply to internal links |
How is my comment controversial. Nobody else even achknowleged that I said it. I can add new spam pages all day to an existing site and they get ranked real fast. I have had over a thousand spam pages get ranked in one week. I put the same pages on a new site and they may never get indexed.
[edited by: ogletree at 9:43 pm (utc) on Sep. 29, 2004]
There have been numerous threads that have changed from "Sandbox" to a different wording.
I don't know if sandbox is the best word, but I definitely don't think that Penaltybox fits it at all.
Like it was said earlier in this thread, these aren't penalties.
|I'd like to hear from someone whose pages have come out of the sandbox and are now doing well. |
... and so would I! Any takers?
What percentage of sites submitted during the last eight or nine months have slipped out of the "holding pen"?
Just one that slipped out in May. I just kept doing business as usual and early may it slipped out.
Here's why the "sandbox" word isn't liked post #22
Let's focus on the topic I've posted.
Please start a new thread if you'd like to discuss other things not pertinent to why the sandbox may exist, such as performance of sites now out, or the name of the phenomenon which is unimportant.
As agerhart and Marcia mentioned, it is NOT a penality, and I'll get the title changed. Thanks.
|"Most webmasters don't have a commercial interest. They do it for fun and to help other people, not for money." |
This is the one of the most infantile and silly posts I've ever seen.
Webmasters get into sites because it IS fun. And challenging. But the truth is, to develop a superior info site, one that, ultimately reaches many people and serves a ton of useful information------YOU HAVE TO BE SEEN IN SERPS. And that involves hard work aka seo. There's nothing wrong with it.
True. but there is a line which if you cross is akin to spamming. Which if you cross just does not make your argument of interest in the website perse true. It would just amount to interest in money. I have nothin against money or building up a site of 10K pages from scratch in a day and putting it up.
But stilll dont you think there is a difference beween the 10K in 1 day site and 10K in one year site. Gaming is for short term. As in not 2-3 months. But maybe a year or a two before someone does a oneup or maybe you just run out of ideas or the game is changed (dont kid yourself that you will always be the best in all games). But when you build a site for the long term you can be rest assured that the time you take to play golf now if you were spamming will be repaid back in the future. There are 2 kinds of people. the nomads who can live and adapt and i mean live well. thats a gift. they are what we call spammers now.
the other kind are the settlers who will settle in with a plan for the future willing to build something that will benefit a few people apart from themselves and last for longer.
I prefer the second (i have nothing against the first, no i am not saying coz i am admist a pack of them) as i believe that it is the settlers who are the reason that so many oppurtunities for others (including you nomads) exist.
one side effect is that we do get a little round and soft as time goes on when the nomads are lean and mean always. Result? change hurts the settlers the most. But no change is ever brought about by a nomad. Its always another settler with another idea.
Well i guess i can start wearing my metal hat waiting for the stones.
>I'd like to hear from someone whose pages have come out of the sandbox and are now doing well.<
There lies the prob
How do you get trapped in the sb, some existing sites of mine, no problem indexed and perform ok in good old fashioned google style. But new sites, going back 6-months in some cases well and truely quicksanded.
Around 5/6 months ago our webhost decided to do shutdown, closed the door and made off with a bunch of money ( not only me many others, it was big news at the time) besides the point..anyhow by the time I could set up new account, transfer domains etc etc about two-weeks elapsed, loads of 404's. At the time I was doing ok to very good in g. Now, these sites near to nothing in g. Other sites hosted with another company not affected.
So it seems to me, although the sites I moved are not new it seems to g they are.
I must admit this sb only drew my attention a month or so ago, cos I couldn't understand how I lost so many places. Sods law, the best sites for income were hosted with the guys who did a runner.
No sob story, I'm optimistic and remain so. But I do realise the difficulty in trying to explain to biz owners, some guys think a "scam" and in reality how can you blame them.
My own theory;
Not many have mentioned a "link sb" rather than a site sandbox.
I believe the sb is related to incoming links.
An existing established site seems to be uneffected, add a new page no problem, provide from existing site, site/map index page blah.. links, indexed and ranked within days. no prob ( at least in my experience
)The only drawback is the content must be related, imho g is good at that now, otherwise the spammers would be in heaven)
However, make a new site different ball game.
As an example, I made a new site (ok completely off topic) with thousands of links, existing pr7-to pr3) maybe? who knows pr now. The site ranks nowhere, after 4 months
keyword anchor text incomming seems to be ignored, for new sites hopefully this may mean an end to guest book spammers.
At this time g does recognise guest book spamming as ligit, I think things will change and hopefully quickly
Dave, I think you're half right. I propose the idea that the "Google Lag" is not an active evaluation of the site in question, as in "If site posseses attribute X, accept it, else do not".
related sites queries show that Google often groups sites together that have links from multiple, unrelated sites pointing to all of them.
Example: I run a site about widgets. A webmaster halfway around the world runs a site about naked cows.
We each go out and get links from well known web directories. 2000 of them, let's say. And pretend they're one page directories with a bunch of links on them.
Visualize: directoryApage points to widgets and naked cows. directoryBpage points to widgets and naked cows. directoryCpage points to widgets and naked cows. Wash, rinse, repeat 2000 times.
Suddenly, Google thinks our sites are related, just because we appear on so many pages together.
What if the "Google Lag" is not evaluating attributes of individual sites, rather, the evaluation is happening on "groups" of multiple links?
Most SEOers here are reporting problems with the "Google Lag". Most SEOers here are getting links from the same sites/pages, using the same linkdev techniques.
Why would they do this? If they did, think about how easy it might be to categorize sites, without the need for a human edited directory.
>>What if the "Google Lag" is not evaluating attributes of individual sites, rather, the evaluation is happening on "groups" of multiple links?
Why would it be happening on just new sites?
Midhurst >> I think he was referring to recieving a sitewide link, not placing one on your site.
Term Sandbox >> [webmasterworld.com...] Message 12 & 14
Jake I'd buy this if nothing ever came out. However I have had a site come out. First the site got PR then aproximately 1 month later it came out.
First PR then emergence. I wonder if no sites will emerge from the sandbox until after the next PR update.
From my experience this definitely does not effect just new sites, although new sites seem more prone.
At the same time, new sites that seem to systematically not get hit are ones that spam blogs and guestbooks. These new "sites" rank very good right away (with their white bars revealing they are post June 1 sites).
What can we conclude/guess about that? Well obviously this spamming blogs gets you tons of links from all sorts of IPs/domains/hosts/etc. So volume of links from unrelated *low-quality* sites makes you (at least far more likely to) avoid lag time. But volume of links from high a small amount of quality domains doesn't (and of course a few links from low quality domains doesn't either). And then, it is near impossible to get volume of links from 1000+ high quality topical domains, so it is hard compare the effect of of this sort of great/diverse linking to the blog spamming link volume.
One oversimplified conclusion though would be: something new achieves a PR5+, goes into lag time. This would explain why the new sites ranking well are those that have thousands of anchor text links, but those links only make the page PR4 or less.
I don't think that is right either though, as sites avoiding lag time include those that blog spam but also buy highish PR links.
Still, I lean toward thinking lag time exists to combat "fake quality", that is, sites buying high quality links to pretend to be of quality. If that is part of the reason, obviously Google hasn't just thrown the baby out with the bathwater, they have launched it with a howitzer.
(I'll also say that I am pretty sure lag time was created in part to help stabalize the serps pre-ipo. Why it contunies to exist is more puzzling.)
>Still, I lean toward thinking lag time exists to combat "fake quality", that is, sites buying high quality links to pretend to be of quality.<
So please explain how can g tell the difference between "quality" and "rubbish" according to PR?
not to mention if they are paying (ok some make it obvious)
jeez, nassa cut.. that new link to me please -;
Maybe g checks out all links to "rubbish sites"
I don't think so!
Take a look at all "big sites" most provide links to "rubbish sites"