homepage Welcome to WebmasterWorld Guest from 54.237.54.83
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 98 message thread spans 4 pages: 98 ( [1] 2 3 4 > >     
Does the "sandbox" Only Affect Phrases Containing Popular Words?
If the phrase has no words over 70-80 million results, does sandbox apply?
ciml




msg:770272
 6:35 pm on Mar 10, 2005 (gmt 0)

While discussing [webmasterworld.com] a most interesting analysis of Google's number of results [aixtal.blogspot.com] figures I speculated that the Google might use a smaller index for popular words, in a manner similar to that explained in a pre-Google Backrub paper.

Liane took this idea further, and suggested that this might explain the sandbox.

So without getting into specifics, what is the view on sandbox applying to phrases that have no words with less than 80 million results?

Keep in mind that many phrases with few results contain at least one word with more than 80 million.

<added>
"that have no words with less than 80 million" should be "that have no words with more than 80 million". Thanks Liane for spotting the error.

[edited by: ciml at 3:33 pm (utc) on Mar. 12, 2005]

 

grail




msg:770273
 7:12 pm on Mar 10, 2005 (gmt 0)


What you describe here is something I would agree with, i've seen this. I could certainly subscribe to this theory of the sandbox.

The inability to rank anywhere for certain words yet ranking for many other phrases with 'alternative' words.

ChrisCBA




msg:770274
 9:46 pm on Mar 10, 2005 (gmt 0)

I can definitely agree that certain words seem to trigger it more then others.

My main phrase is 4 words long, and is a pretty nitch subject. But one of the words within that phrase has 100 million results. Though my site isn't directly related to that 1 word alone... it appears that because that word is included in my phrase I still suffer the ranking effects of it.

mrMister




msg:770275
 10:17 pm on Mar 10, 2005 (gmt 0)

Two words for you:

Bayesian

Filtering

stargeek




msg:770276
 12:18 am on Mar 11, 2005 (gmt 0)

is the sandbox still in effect?
i've seen people talk about launching new domains like its not.

Marval




msg:770277
 1:49 am on Mar 11, 2005 (gmt 0)

Ive often posted that I dont believe that the sandbox actually even exists but is actually the result of a particular filter that was instituted back during the heydays of buying expired domains, that basically placed a filter on new domains with a certain number of "prompt new incoming links" compared to an average "latent incoming link average" based on history of either that specific domain (which seems far fetched computing power wise) or averaged over the entire "new site" population.
It is the only way that I can explain some new domains having no problem what-so-ever getting into the index within 1-2 of the "old" update cycles and some domains taking 6 cycles to get in.
This is also just based on a few hundred domains testing so it might not be a good statistic over the whole web and is only based on those few hundred domains bought and developed since the discussion of sandbox started. The other factor that I believe is in play is that in the case of these domains they were all developed with non-dynamic content so they do not represent any of the new dynamic/generated pages for "new sites" out there.

tenerifejim




msg:770278
 1:21 pm on Mar 11, 2005 (gmt 0)

So without getting into specifics, what is the view on sandbox applying to phrases that have no words with less than 80 million results?

Is anybody else thinking Bingo!?

BillyS




msg:770279
 2:19 pm on Mar 11, 2005 (gmt 0)

I've been involved with the creation of some very large Oracle databases and about four months ago described why I thought the sandbox existed.

No matter how large Google is, all databases are faced with the dilemma of optimizing for depth of query or speed - not both. For example, a customer information system is optimized for speed. The system has to return information fast to the representative, but what is given up is the complexity of information you can get from the system. You can only ask the system for information that has been pre-identified.

In contrast, a decision support system or data warehouse can look into a database and return to the user many different types of information sometimes referred to information cubes. Since these systems are built for query flexibility, they are slower in response than a customer information system.

When faced with this decision, Google has lots of options, one of which could be to use two databases or indexes. When the user enters a search term, the query is directed to the primary index and results are returned from a pre-built set of results if the phrase or word exists in that database. If not, then it is routed to the secondary database where the query is run live.

The first index does not require much horsepower since it is a subset of a larger index and the results are predetermined. The more complex queries or infrequently used queries that are directed to the second index are where the horsepower is needed.

This theory could be at least part of the explanation for the sandbox affect. For example, there may be rules or filters that Google applies to specifically exclude certain sites from the primary index. This would make the results on common phrases more stable and make the database easier to maintain.

Nikke




msg:770280
 2:48 pm on Mar 11, 2005 (gmt 0)

So maybe those seeing the Search Harder option are looking at this cached index or whatever you'd like to call it?

The Search Harde option would take them into a deeper search?

shri




msg:770281
 4:43 pm on Mar 11, 2005 (gmt 0)

80 million?

So, the term mortgages would not be sandbox (report 31M returns)?

bakedjake




msg:770282
 4:47 pm on Mar 11, 2005 (gmt 0)

Two words for you:

Bayesian

Filtering

Do you even know what those two words mean?

ciml: I still think it's a measure of the competitiveness of a term (as defined by the number of results returned for an allinanchor query).

caveman




msg:770283
 5:20 pm on Mar 11, 2005 (gmt 0)

BillyS, what you describe - as a general matter of computation and speed - seems more related to the supplemental index pages, which IMHO G has categorized as 'not very necessary' (i.e., relegated to: 'show these feeble pages if nothing else helps').

The fact that pages with too many special characters typically get dropped into the supp index, is instructive IMO. But I don't see delivery speed as having much to do with the algo elements and filters collectively referred to as 'sandboxing.'

======

As for the notion of 'popular words,' the idea of crossing result sets of individual kw's is fascinating.

But I'm not at all sure how the related notion of search volume / kw volume helps explain the number of no-show sites whose main searches fall underneath the sorts of volume threasholds that would logically have to exist to hold this theory together.

I don't believe this is related to real-time assessment; seems predetermined to me, though I still hate the idea of a literal blacklist.

RS_200_gto




msg:770284
 5:47 pm on Mar 11, 2005 (gmt 0)

After some research, I made a small conclusion about our search term in Google SERBS and found out why we are not in the SERBS the conclusion based on our search term is that (Results 1 - 10 of about 1,350,000 for widgetemployment is being filtered by holding (widget)employment hostage so that there may be rules or filters that Google applies to specifically exclude certain sites or search terms from the primary index.

Liane




msg:770285
 6:00 pm on Mar 11, 2005 (gmt 0)

CIML, I am confused. Did you mean to say "with no words with more than 80 million results" or did you mean to say "no words with less than 80 million results"?

If the figure has to be 80 million, or more and after thinking about this for a couple of hours while doing several very unsophisticated tests on the fly ... I don't think this has much to do with the sandbox afterall.

There are just too many examples of sandboxed sites and pages (again if there is such a thing) that fall into a far smaller subset of results where the competitivenes is under even 20 million results.

BillyS




msg:770286
 6:33 pm on Mar 11, 2005 (gmt 0)

BillyS, what you describe - as a general matter of computation and speed - seems more related to the supplemental index pages, which IMHO G has categorized as 'not very necessary' (i.e., relegated to: 'show these feeble pages if nothing else helps').
The fact that pages with too many special characters typically get dropped into the supp index, is instructive IMO. But I don't see delivery speed as having much to do with the algo elements and filters collectively referred to as 'sandboxing.'

Your view of the sandbox is too limited. Google can be the greatest SE in the world from a "finding what I need" standpoint, but the mechanics and speed of delivering that information is just as important.

When Google designs its database, it is not the frequency of occurance that matters (number of results). It is the frequency of query. More frequent or common queries should be handled efficiently from a CPU standpoint. That is, the results are precompiled.

Google, MSN, they all realize this. Why do you think that they only return 1,000 or 250 results? Just because they have 80 million websites that qualify does not mean they are even thinking of delivering those results to the end user.

For the oddball queries, the search needs to be done "on the fly." This is computationally expensive.

Liane




msg:770287
 7:33 pm on Mar 11, 2005 (gmt 0)

Why do you think that they only return 1,000 or 250 results? Just because they have 80 million websites that qualify does not mean they are even thinking of delivering those results to the end user.

Enter geotargeting and Google Italy, Spain, Canada, Texas, Barbuda & Antigua ... etc. etc.

caveman




msg:770288
 7:41 pm on Mar 11, 2005 (gmt 0)

Your view of the sandbox is too limited.

I've been involved with the creation of some very large Oracle databases and about four months ago described why I thought the sandbox existed ... there may be rules or filters that Google applies to specifically exclude certain sites from the primary index. This would make the results on common phrases more stable and make the database easier to maintain ...

You think the (so-called) sandbox is mainly a db maintainence issue? And my view is too limited? Hehe.

BillyS most of what you state is obvious; of course G is concerned about computational efficiency. But do you really believe that G is incapable of handling all those queries to the point that they are not showing thousands of newer sites? To make their db easier to maintain? Wall St. would love that. ;-) This has nothing to do with the (so-called) sandbox, or the topic of this thread.

But I'm with Liane (again)...I'm confused. When ciml (and Jake) make posts along the same lines you gotta pay attention. I just don't see how the so called sandbox relates to high volume kw occurance only. Seems like too much evidence to the contrary.

=====

Def: "Sandbox [webmasterworld.com]"

steveb




msg:770289
 8:42 pm on Mar 11, 2005 (gmt 0)

Do searches for people's names. You need nowhere near 80 million results (for either word individually) to get frozen. I also haven't seen any difference between the equivalent of Mark Elstonstiles versus Ricardo Elstonstiles.

BillyS




msg:770290
 11:39 pm on Mar 11, 2005 (gmt 0)

BillyS most of what you state is obvious; of course G is concerned about computational efficiency. But do you really believe that G is incapable of handling all those queries to the point that they are not showing thousands of newer sites? To make their db easier to maintain? Wall St. would love that. ;-) This has nothing to do with the (so-called) sandbox, or the topic of this thread.

caveman -

To answer your question - G does not have the computational power. What do you think, that they are doing one search a day through 8 billion records? They are handling hundreds of millions of queries and wall street could give a flying fox about databases.

It's not about maintaining a database, it is about delivering fast and accurate results. Wall Street might like to hear they have thousands of servers, but they fall asleep when engineers start talking about star schemas and normalized databases.

This has everything to do with this topic, there are limits to everything and the sandbox is another limitation. If you were designing a database this size, would you let just any site in your index if it did not pass a filter?

And why in the world would an engineer create an index based on the number of times a word or phrase appears? Think of the end game, then you know where to start. Those of you that have built databases in the terabyte range, know what I am talking about.

Powdork




msg:770291
 1:09 am on Mar 12, 2005 (gmt 0)

This theory has holes. Holes the size of really big holes. This theory holds no water.
It is the colander of theories.;)
Just poking fun, but what steveb said above is true. It may be something defined or triggered individually by a single word within the query, but certainly not that level of volume. I would suggest that there may be another parameter associated with a keyword's popularity that is used. I would suggest something like level of search activity for words within the query, or the activity of the query as a whole.

caveman




msg:770292
 1:38 am on Mar 12, 2005 (gmt 0)

If you were designing a database this size, would you let just any site in your index if it did not pass a filter?

Site? No. Page? No.

But you keep stating the obvious. We all know there must be ways to deliver relevant results at fast speeds (as they mainly do). Your arguments are good ones for a supplemental index that stores pages of secondary importance, to be used only when an insufficient number of pages from the main index address a query (and gee, there is one of those!).

But "fast accurate results" are not good arguments for excluding sites in the way that sandboxing has done, because you're essentially saying G is not capable of delivering high quality pages quickly if those pages reside on sites introduced after Arpil '04 (or, that G chooses not to show those pages for technical/engineering reasons). It's ... unlikely.

#8: This would make the results on common phrases more stable and make the database easier to maintain.

#19: It's not about maintaining a database, it is about delivering fast and accurate results.

Ummm, oh, nevermind. ;-)

===========

Reminder: Topic of this thread is "Does the "sandbox" Only Affect Phrases Containing Popular Words?" ... Not, "Is the sandbox caused by the desire to maintain fast and accurate results?" :)

===========

OK, so some say it's search result or search term volume or popularity, and others (like me) don't get how this dovetails with the notion that the vast majority of sites for a while did not see much daylight.

I don't see any correlation AT ALL between those that never got sandboxed, and those that did, WRT the nature of the kw's involved. None. (I've been told by some very smart people who know more than I about this that the correlation is there; I just have not seen enough to be convinced for myself...too many other variables, like backlinks).

BillyS




msg:770293
 2:14 am on Mar 12, 2005 (gmt 0)

#8: This would make the results on common phrases more stable and make the database easier to maintain.

#19: It's not about maintaining a database, it is about delivering fast and accurate results.

Ummm, oh, nevermind. ;-)

caveman, your cute ways are lost on me ;-)

These statement say the same thing, just in reverse order. Google wants stable and fast results, maintenance of the database is a secondary consideration but one that remains important.

I have experience building a LVDB and have spent many hours reading on the topic. Some people don't have anything close to that experience, yet seem to be experts on how they are constructed.

And you seem to be confusing the "supplemental" index with my thoughts on how google works. To be honest, I am not sure a "supplement" index or database actually exists. When I see "supplemental" results, these appear to be pages that no longer exist as they once did and are marked as such in a database. Old pages or pages of questionable quality, that's all, that are returned to the user as a last resort.

Again, on the topic here:

Does the "sandbox" Only Affect Phrases Containing Popular Words?

If the phrase has no words over 70-80 million results, does sandbox apply?

Intelligent people would never build a search engine that looked at word frequency. They would look at query popularity or frequency of query. The sandbox is about returning quality results on frequently queried terms, not frequently appearing words.

BillyS




msg:770294
 2:23 am on Mar 12, 2005 (gmt 0)

OK, so some say it's search result or search term volume or popularity, and others (like me) don't get how this dovetails with the notion that the vast majority of sites for a while did not see much daylight.

I don't see any correlation AT ALL between those that never got sandboxed, and those that did, WRT the nature of the kw's involved. None. (I've been told by some very smart people who know more than I about this that the correlation is there; I just have not seen enough to be convinced for myself...too many other variables, like backlinks).

===========

Reminder: Topic of this thread is "Does the "sandbox" Only Affect Phrases Containing Popular Words?" ... Not, "I just have not seen enough to be convinced for myself...too many other variables, like backlinks" :)

===========

Are you starting a new topic? Not sure what backlinks have to do with the poster's question. ;-(

2by4




msg:770295
 3:52 am on Mar 12, 2005 (gmt 0)

BillyS, thanks for posting on this, I've missed having guys with decent db experience here, I don't see caveman and you really having any particular differences in this question unless you want to get into nitpicking details, which are always fun to argue about. Interesting stuff though, I'd like to hear more on this question, this whole thread is making an unusual amount of sense to me.

McMohan




msg:770296
 7:33 am on Mar 12, 2005 (gmt 0)

How about a newer angle to the whole story?, ofcourse based on the premise that CIML has raised.

Assuming Google frowns upon SEOs who try to influence its ranking, Google would try to find a way out. And the way out is a sandbox.

What might be the factors that signal to Google that the sites are worked upon by SEOs, so that it can send them to Sandbox? There may be many factors, but I am highlighting on two. Intuitional, not based on any hard evidence.

1. Pace of backlinks for new sites/old sites without many backlinks. A spurt in rate will signal it.
2. Now this point is dear to me. The anchor text of the IBLs. If the sites are linked to by money keywords (by word frequency or search popularity, you guys decide). This would be a clear signal to Google that, the spurt in link acquisition is not natural (like an event hosted by the company or a new cool tool introduced by the site), but deliberatly done by SEOs to influence the rank for money keyword in anchor text.
Before you ask me, won't then competitors do it, if not you? Simple, why will a competitor bother to sabotage you if you are a new site and thus not ranked anywhere? Then it must have been done by you and you go to the sandbox.

Then what you do to avoid this? Just get a couple of links with your company name (unique) in anchor and wait till you are atleast ranked within 200-300. Get few more links with company name + keyword. Wait till you are ranked within 100. Do this on a slow rate, till you are sure that you missed the sandbox and then go all out. (I personally haven't seen an established site ranking well go into sandbox for getting many links in a short span of time or with money words in anchor. This may very well be a shield by Google to protect you from competition sabotage)

Again, this is personal view and not based on hard evidence. But currently in the middle of testing it out :)

shri




msg:770297
 12:47 pm on Mar 12, 2005 (gmt 0)

I agree with Jake. This is not as much about frequency of words in the database as much as it is about competitiveness and sectors.

Travel / Location specific serps are definately quarantined or evaluated differently now, than they were about 13 or 14 months ago. So are many financial sectors. Both these have pages far less than 80 million.

We've got plenty of evidence based on searches starting with "two word city" neither word1 nor word2 or word1 word2 have more than 56million results. Yet, even phrases which have less than 20K pages (say, word1 word2 obscuredistrict) were difficult to rank for, with a new site.

Also, I'm not sure if my earlier example with "mortgages" was correct, while it appears about 30 million times, "mortgage" appears about 129 million times so that might be an invalid exception.

mrMister




msg:770298
 3:06 pm on Mar 12, 2005 (gmt 0)

Two words for you:
Bayesian

Filtering

Do you even know what those two words mean?

From what I've seen, I get the suspicion that there is a database of web pages that are grouped together in some way (maybe spammy pages, maybe there is some other criteria for selecting them). New sites are passed through the bayesian filter. If that determines that the pages within them have a reasonable amount of similarity with the pages in the database then the site is sandboxed.

This would go some way to explaining the keywords theory, where sites on a certain subject are sandboxed.

theBear




msg:770299
 4:57 pm on Mar 12, 2005 (gmt 0)

Folks you might be interested in looking at how this puppy works.

It can be found at sourceforge and is called wordindex.

It is a fast retrieval system based on the words in a file. The indexer is kinda interesting as well.

Note it doesn't have any way of rating things but it sure is good and fast finding stuff.

BillyS you keep right on yacking. You are making a lot of sense to this old timer.

I swear the faster we go the further behind we get. Oh and I still see the same mistookes happening over and over. So we aren't getting any smarter.

webhound




msg:770300
 5:22 pm on Mar 12, 2005 (gmt 0)

Yeah I've been thinking this for months and months now. The more competitive the term, the tighter the filter. However this is determined. Could be # of results, could be category, who knows. But I do know that we have always come up for the more obsecure phrases and not the main kws. Haven't since January 03.

Interestingly enough one of our sites just popped onto the first page for some pretty sweet terms just this morning. This site had previously been buried.

Only changes we've made was to the title of the site, and to the inbound links - previously all inbound links went to the \ root directory and there was only 3 different variations to the anchor text. Now all inbound links go to the index page and anchor text is different as much as possible.

So we are sitting here scratching our heads as to why this site would come out of the "sandbox" while none of the other sites have.

BillyS




msg:770301
 5:52 pm on Mar 12, 2005 (gmt 0)

Let's pretend that I am on the right track and there are one or more databases that Google uses to return results. I am going out on a limb here and stating the Google is competent and does not make mistakes, only compromises...

Spam is a problem for google. Spammers can easily get hold of popular keyword or phrases. Google also creates a primary results database that is fast and is based on frequently queried terms. Ahh, a good fit for the engineers - right? Exclude certain websites from the primary index and you can return fast, stable and spam free results. The casualty will be new websites of high quality - right?

But why should the engineer care about that? If a search term is popular, then it has probably been asked and answered millions of times. If your asking for information on "hotels in california" then how stale can sites greater than 1 year old be? (pretending the filter is based on time).

But how to keep spammers out of the primary index or database - sandbox sites? That is the challenge of the engineer. How can they mark a page or site to indicate that this does not qualify for inclusion in the primary index?

One way is to create a composite filter that results in a "spam score." They have a lot of information so this is quite easy. The problem for us it to reverse engineer what the score might be based on to prevent it from tripping - if that is even possible.

Personally, I think that Google is giving us two pieces of information to work with already:

1. - brand new websites seem to avoid the sandbox in their first few weeks of existence.

2. - The link: command does not return all the links that google is aware of. It appears to be broken - or is it?

When I am involved in problem solving at work, the first thing I like to do is list the facts. Then see if they start to group naturally. Then start to ask the "why?" There is a lot of good information that can be shared since it appears many sites were just released. If we keep taking the same approach, we will keep coming up with a dead end.

This 98 message thread spans 4 pages: 98 ( [1] 2 3 4 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved