Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Nailing down the "sandbox"

How deep is the sand? Who has to play there?

         

suidas

10:51 pm on Jan 17, 2005 (gmt 0)

10+ Year Member



I've seen a lot of messages about the sandbox, but none of them are clear about how major the effect is. Recently someone responded to a why-isn't-my-site-number-one request with:

If your site is less than a year old you are likely sandboxed.

I can't believe most sites under a year's age are in some sort of penalty box. Google would be useless. So, I want to know:

1. Are all sites sandboxed, or do certain traits (like affiliate links, low content) trigger it?
2. How long does it last?
3. How variable is the duration?
4. How do you know your site is being sandboxed?
5. Does the effect taper off or is it a binary thing?
6. What gets you out of the sandbox? Is it merely time or do good links or whatever speed it up?

Thanks.

Luckasoft

10:06 am on Feb 1, 2005 (gmt 0)

10+ Year Member



---My question is ,all of those millions of new sites that have been created the last 9 months do you really think that the will be at the top 10 results nailing down sites that have been established many years ago?

I am telling you, when I search using Keywords + trademark (which is also domain name) - my site is still nowere. I get tons of descriptions of our product on "old" sites instead.

---if you don't like google make your new sites or focus your "sandboxed" sites for traffic from yahoo or msn.

Thanks god I did. And getting some traffic.

robster124

10:37 am on Feb 1, 2005 (gmt 0)

10+ Year Member



I can't understand you guys who are suggesting that the whole sandbox phenomenon is a capacity problem and pointing to sentences in their IPO brochure that tell share buyers that there is a risk that Google may not be able to cope with extra data etc.

When I do site:mysandboxedsite.com - it's there! It's there in the Google databanks! The problem is is that they choose not to use this data in constructing SERPs.

Please correct me if I'm missing something important here - but surely this is not indicative of a capacity problem...

pr0purgatory

10:47 am on Feb 1, 2005 (gmt 0)

10+ Year Member



And here's the fun one, I stuck my site into the sandbox on purpose, I want to see how it works first hand. It's lasted longer than I was hoping, oh well...

2by4 I have 2 questions for you...

1.) Are you completely INSANE? :P

and

2.) What exactly did you do to get your site IN to the "sandbox"?

lammert

3:42 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



robster124: You are saying the same what I was thinking while reading this thread the last days. Google has two indexes, they admitted themselves, but they have no problems to mix these two indexes in one search query. In fact if you search on no-money keywords, sandboxed sites rank high, as soon as one money keyword is involved they sink to the bottom. I have not seen many difference in the query time that is displayed for both situations. Searching more than one index for a query does not seem to take much more computing effort than searching only the primary index.

So Google has a query system which is capable of searching both indexes at once and also capable of giving one index more priority than the other, depending on keywords. This is actualy from a software point of view more difficult than a query system that treats both indexes equal all the time.

There may be index problems to store all webpages in one index, but there are no query problems with more than 2^32 pages spread over more than one index. They handle this situation transparant for the searcher. Normal, and supplemental results can be present in one SERP.

Which leads IMHO to just one conclusion: the sandbox is there intentional, not by accident, or because of index capacity problems.

europeforvisitors

3:43 pm on Feb 1, 2005 (gmt 0)



Doesn't this make the sandbox story just a wee bit more worthy of media coverage?

No. It's perfectly normal for a company to include worst-case scenarios in its prospectus. If a cruise line included a statement that one of its ships could sink, would you take that to mean they were worried about the seaworthiness of their vessels?

You have to look at the bigger picture. If G wanted to increase revenue via AdWords channels, why not sandbox ALL sites - OLD established sites with deep pockets?


I believe that this is very obvious. If they applied it to all sites it would have been noticed instantly and it would have created real bad publicity. Doing it with new sites let's it happen gradually and eliminates risk.

If Google really wanted to put the squeeze on potential AdWords advertisers (an allegation that gets repeated here several times a day), they could simply reweight their algorithm to favor information pages over boilerplate affiliate pages, e-commerce catalog or order pages, and "scraper" directories. Such a change would be true to Google's corporate mission statement, it would improve the quality of search results in the eyes of most users, and any boost to AdWords revenues could be defended as a happy coincidence.

lammert

4:43 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For those who are still convinced that the sandbox is the result of index problems because the amount of stored webpages went above 2^32, I would advice to read the document at [computer.org...] published in march-april 2003 by the IEEE. This is a scientific document written by three Google Technicians and goes in detail about their index system.

Google inverses the document index to a word index, i.e. for every possible search wordt they maintain a list of document entries that contain that word. If there were a 2^32 index limit, it is not a 2^32 document limit, but a 2^32 keyword limit. There is also no sign that their index has a 2^32 (4 Gigabyte) size limit. I quote:

The raw documents comprise several tens of terabytes of uncompressed data, and the inverted index resulting from this raw data is itself many terabytes of data.

If you read this document - it is very interesting - you will see that they had just one thing in mind: scalebility, not a fixed index with 2^32 entries. In 2003 they had "a few thousand machines per cluster". If you need to serve twice as much webpages, simply double the number of machines and you get the same querytime.

if ( sandbox!= index capacity problem ) then sandbox = intentional; endif

trillianjedi

5:00 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if ( sandbox!= index capacity problem ) then sandbox = intentional; endif

Or an unintentional side effect of something else which was intentional.

TJ

lammert

5:20 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



trillianjedi: I agree that it might not be the sandbox itself which was the first intention, so maybe I put it to hard, but a technical index issue is very unlikely. As this thread is about nailing down the sandbox, what options are left?

Google's goal is to make all information available through the web. If they don't see e-commerce sites as valuable as editorial sites, it might be the reason to give one site better rankings than others. Older sites have the benefit of the doubt. If this is the case, then no-one who is in the sandbox with a site type which Google doesn't like will ever come out, unless the site content changes. In this case the sandbox itself is intentional.

On the other hand if the sandbox is there to identify and eliminate spam sites, than genuine sites - both e-commerce and editorial - would see their ranking increased after some months in the sandbox. This would be a sitiuation where the sandbox is only a side effect of spam elimination.

We see that many sites are in the sandbox for quite an amount of time now, longer than what I would expect to be necessary for deciding if a site is genuine or spam. This makes me think the sandbox is intentional indeed.

xcomm

7:27 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



... clouding the waters...

Lammert,

Google's goal is to make all information available through the web.

False. Google's goal is to make as much money as it could through Adsence/Adwords to maintain it's survival.

But as its reckless management may gaming too high it seem right on the way to ruin its fundamentals the SE.

BTW:
There are a lot of once big players out crying to have lost developer support...

lammert

8:12 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



xcomm, I don't want to argue about the main goal of Google. I thought it was this - and I really hope they are there primarily for the information, not the money - but if you have other thoughts I respect them.

What I do like to discuss is the index theory and the relation of it with the sandbox. 2by4 has dominated this thread with posting about (t)his theory nearly every three messages, and I am really looking forward to read his comments on my posts.

TaylorAtCTS

8:46 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



lammert google is a business, their goal is to make money. Their mission statement and the actual goal is different. Their mission statement is what they tell the public they try to do, their goal as a business is to make money.

lammert

8:51 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



TaylorAtCTS, thank you for telling this a second time, but it is not the main point of my posts. I would prefer a reaction on the index vs. sandbox issue.

This is a thread where the last 150 posts were hijacked by the idea that Google is already more than a year facing massive capacity problems because of some sort of 2^32 issue. Unfortunately the only reactions I get are about minor issues in my posts, not about the real thing.

Sorry that I posted real world information to this thread. I guess most of you liked speculations and conspiracy theories more than the truth.

- Lammert

Scarecrow

9:21 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



For those who are still convinced that the sandbox is the result of index problems because the amount of stored webpages went above 2^32, ... blah, blah

The 2^32 limit relates to docIDs, not wordIDs. These are two different types of IDs.

Here's another fun paper [www9.org], co-authored by a Google researcher:

"Rather than deal directly with URLs, the Connectivity Server uses a set of densely-packed integers to identify pages."

"Recall that page identifiers are a dense set of integers."

"To avoid wasting space, we pack vector records densely."

"Functions in the Connectivity Server convert between these integers and text URLs. In our work with the Connectivity Server, these identifiers have proven more convenient to handle in code than text URLs."

"Notice the use of integers to represent terms; as with page IDs in the Connectivity Server, we find these to be more convenient to manipulate than text strings."

"The page ID of the vector is stored in the first 4-bytes of the vector's record."

Now look at The Anatomy of a Large-Scale Hypertextual Web Search Engine [www-db.stanford.edu] that describes the original Google architecture.

Look about halfway into the paper, where it says, 4.2.6 Forward Index. There is a diagram on the right side of the page. They show the docID as 27 bits. In the text they explain that the docID is combined with the length of the hit list. In other words, in the beginning they only had a 27-bit docID. Obviously, they had to break off the 5 bits for the length of the hit list into a separate byte fairly soon after that, because 27 bits gives them only 134,217,728 unique docIDs instead of the 4.2 billion we all assume.

There is no question whatsoever that for years Google used a 32-bit docID. The question is when did they change it, if they did?

Sure, you can have two, three, or many indexes. The problem is that you run into difficulties merging them into a coherent ranking forumula unless every unique web page has a unique docID. That is exactly what we're seeing today.

2by4

9:36 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"This is a thread where the last 150 posts were hijacked by the idea that Google is already more than a year facing massive capacity problems because of some sort of 2^32 issue. Unfortunately the only reactions I get are about minor issues in my posts, not about the real thing."

Happy? You got two. The threads weren't 'hijacked', we're not talking about 'conspiracies', what is up with this kind of talk? The subject is the sandbox. All of the issues brought up are directly related to the sandbox. The sandbox has great interest to people. That's why these threads get so long. The subject has never veered from the sandbox. Different causes and explanations for the sandbox have been offered, as well as a possible way out, useful if you are in possession of an authority type link farm like optiplex is. Some posters insist on maintaining that a system that is increasingly starting to look very broken is not broken. Why. No idea, I think some people need to believe in something infallible, and have chosen google to occupy this position. God works better, buddha, but if google does it for you, go for it.

lammert, that's a good article. But it doesn't really say anything that new, although it does make clear one thing:
google will not use hardware that is not cheap. Which explains the lag to fully upgrade the systems I think. Price/performance/heat is critical to them. The prospectus quote given shows extremely clearly that not only is google planning on updating their systems this coming year, or next at latest, they do not have total faith that the process will succeed. Just how much concrete evidence is required to get this point home? How much failure will it take until people in general will admit it? Obviously far more than I thought. That is not a standard disclaimer, that's a very carefully written passage that is covering google's as$ from the lawsuites people suggested. Buyer beware.

it's the docid's that are the restricting factors if I remember correctly. If you read back through that you'll see that. And it's the docids that were written with 2^32 limitations. That's why, if you were watching, google's home page sat stuck at that number all last year. Since this fact was self evident I find it hard to argue it with anyone unless they can offer a reasonable explanation why that number was stuck there. Many people have offered such explanations, but for some reason they just won't stick.

The terabytes of data are the actual content that is retrieved using the primary index, someone correct me if I'm wrong here, I'm sure you will. There's no restriction on how much data than can store, there's a restriction on how many documents can be indexed within a single index.

So let's see where we are now: google's prospectus has no meaning. the stuck pages indexed has no meaning. The overnight doubling of the index has no meaning. Google representatives telling someone directly that they run 2 main indexes has no meaning. I have to conclude that there is nothing anyone can say to some of the posters on WebmasterWorld to get them to comprehend the reality that is sitting in front of their faces. Maybe google is doing more PR work than I first thought, either that or some people just didn't learn any critical thinking skills in school.

<added>Scrarecrow, LOL, we posted at the same time.

[edited by: 2by4 at 9:54 pm (utc) on Feb. 1, 2005]

SEOPTI

9:43 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see a problem with domains which are not in the sandbox getting back into the sandbox all the time.

This is really frustrating. You can't change title or content of sites because the domain gets back into the sandbox.

2by4

10:03 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"I have 2 questions for you...
1.) Are you completely INSANE? :P
and
2.) What exactly did you do to get your site IN to the "sandbox"?"

Well put. 1: hard to tell, they say the insane think that they are sane, so you'd have to ask someone else.

2. About the opposite of what optirex recommends.

I had to rebrand it anyway, so I figured, hey, what better time than now, I want to really see the sandbox first hand, LOL, and I am seeing it. Why not, traffic was jumping up really fast, everything I put on it was hitting top 10 in a few days, what better time to do a rebranding... Actually, I really thought that the 301's would cut down the wait as google realized it was the same site, but no such luck.

Again, when I did that I didn't expect that Google was as seriously broken as I'm now starting to realize it is, I was looking at maybe 4-6 months, but now I realize that it's not a question of time, it's a question of when google upgrades to the new version, I'll call it google II. I've sacrificed this site before, oddly enough both times google has been directly at fault for that, but in two totally different ways.

[edited by: 2by4 at 10:09 pm (utc) on Feb. 1, 2005]

nzmatt

10:03 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



Hasm, very interesting find with your keyphrase x16 – although I’m still not sure exactly what it means…

I tried it and my sandboxed site is in the top5! Just like MSN/Yahoo!

I just about fell off my seat in shock!

There is my sandboxed site staring back at me.

I normally can’t find my sandboxed site in the Google SERPS unless I include the brandname/url in the search phrase.

This post is WebmasterWorld working at it’s best as many people are sharing helpful little bits of the puzzle.

2by4

10:07 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



nzmatt, I agree, I think this may be the most real information filled sandbox thread I've read in the last year. People are dropping some high quality knowledge here. I thought it had ended a bit earlier, but the observations and factoids keep rolling in.

lammert

10:25 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Scarecrow:

two good articles you came up with. The first is however IMHO not relevant for the Google index duscussion, because in the last paragraph it says: "[...]at Compaq Computer Corporation's Systems Research Center, which is where the research described here was done." So I see no connection with Google. Also all testing was done at Altavista and Yahoo data and on Alpha hardware, not on Google data and PC's.

The second document however is without doubt related to the current Google search algorithm. You are right that the 27 bits used for the docID only gives some 100 million documents to store. So they must have changed this approach even far before 2003. If it was possible to move from a 27 bit wide docID to lets say a 32 bit wide first (necessary for 4G documents) and reformat the datarecords for this new fieldsize, there is no doubt they could also expand it to more (64 would be the most logical step looking at their x86 based architecture). And such a size increase of just one field can be done overnight by taking one cluster of line, reformatting the records and than bringing the cluster on-line again. The main search algorithm is not affected severly by only a field size increase.

As queries can display a combination of sandboxed, and non-sandboxed items, they have increased the docID to more than 32 bits without doubt. Even when they use separate indexes with a maximum of 4G docs each (which is possible because of the distributed design) there still has to be an ID for the index. In the case they didn't increase the docID but instead added a indexID, they effectively created an globaldocID represented by the pair (docID,indexID). And this globaldocID is still a unique representation of each document.

For searching there is no implicit docID field size problem involved. The docID is an ID, a number which could be anything to reference a record. The search algorithm itself is not dependent on the size of the docID because the algorithm is searching on words and returns the docID. Only for pagerank calculations every document must be indexed seperately, but pagerank is a value assigned to each page, not to a search query and can therefore be calculated off-line in batch runs.

2by4:

First of all I don't believe the counter at the homepage of Google. They just give it a value for political reasons, like starting discussions here :)

I have recently added many pages and they all showed up in the primary index within 24 to 48 hours. They are in the top 10 for competitive keywords, but yes, I have an old site, existing even before Google existed...

Don't get me wrong. I do not say there is no sandbox. Seeing all the people having problems to get new sites at a reasonable position in Google is a clear indication that the sandbox exists. What I wanted to say is that this is not because of a technical problem, but Google has reasons to do this. Either to discourage scraper sites, or to to boost information sites in favor of commercial sites, or as some people here have suggested to boost the use of Adwords. And IMHO they have the right to do this. It is their index and we don't pay for the use of the index. We only complain when our own site doesn't rank in the top 10 listing.

But please see everything in the right perspective. Google started with a 27 bit index, not 32. So if they were capable to upgrade from 27 to 32 they are capable to upgrade to every ID they want. How many PhD's are working there? Do you really think it will take 50+ PhD's more than one year to increase the size of an identifier? LOL

If you are thinking that every piece of information comming from the Googleplex should not be taken seriously, I think we have come to the point of religion, not science and I don't want to go into a religious debate. You are free to think what you want. Because all your assumptions are based on this one thing: "Google doesn't tell the truth", it is very difficult for me to find arguments that you are interested in to hear.

2by4

10:47 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"they have increased the docID to more than 32 bits without doubt."

Hmm. Why? I mean, why without a doubt? Why was 32 bit chosen? What type of systems are running this? The choice of 27 makes sense, smaller data set, moving to 32 on 32 bit machines makes sense. I wish I had a better background in hard core db number crunching, there used to be guys here who did, they'd usually fill in the details, sigh...

In the original thread on the capacity issue, one or two guys who obviously knew how to work with very large systems, and who understand these questions much better than me, do not agree with the statement that doing that type of upgrade is easy or trivial, and I'd suggest that the text of the prospectus really admits this very clearly, it's not easy, and they haven't done it yet.

Re believing or not believing what google says: I don't believe or disbelieve them, their words have about the same value as the words of any other large business enterprise. Sometimes they will be true, sometimes they will be half true, and sometimes they will be flat out lies. However, you'll note, I do believe some of the things they say, for example the text on the prospectus, which has to be true, for legal reasons, I believe. Personally, I think the index page count was a vestige of Google really not wanting to be evil, I see it as sort of the victory of honesty for that time internally. They would have been better off simply removing that IMO, but they didn't. I think there are googlers who really took the whole don't be evil thing seriously, I suspect now the ones who got good stock option packages are willing to retreat from that position a bit, like most people would if offered the same level of wealth.

"I have recently added many pages and they all showed up in the primary index within 24 to 48 hours. They are in the top 10 for competitive keywords, but yes, I have an old site, existing even before Google existed..."

To say the primary index is full is like saying a glass is full on a sunny day. Water evaporates, sites and pages come and go, it's not totally full, it's just pretty much full. But even there, I've seen radically longer lags than I used to for adding large blocks of new pages, also the subject of some past threads here. Why? To me it looked for all the world as if the spider was just waiting for permisssion to grab them. PR obviously would affect a site's priorities, WebmasterWorld for example will get all its pages in fast. But smaller sites, lower pr sites, also used to get all their pages in fast, that's not the case anymore.

"So if they were capable to upgrade from 27 to 32 they are capable to upgrade to every ID they want. How many PhD's are working there? Do you really think it will take 50+ PhD's more than one year to increase the size of an identifier? LOL"

I'll admit something, I grew up around phds, and I don't have the faith you seem to have in that label. Upgrading to 32 bits in a 32 bit system, yes, that seems quite simple. Updating to a 40 bit system within a 32 bit system, not the same game, if I knew processing a bit better I'd be happier I'll readily admit, someone a while ago pointed out that the newer intels can address more than 32 bits of memory space, but I don't think that's the actual roadblock, I think it's more basic than that.

Microsoft fails to do things routinely, they have lots of phds, and a research budget that is usually higher I think than Google's gross income though they just cut it in half. Their phd collection failed to create their new filesystem, due in NT 4, then in 2000, now in longhorn. Many phds do not equal success. In fact, sometimes it equals failure.

But a thoughtful post nevertheless, much appreciated.

2oddSox

11:21 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



Hasm, very interesting find with your keyphrase x16 – although I’m still not sure exactly what it means…

I think you were referring to Mahoogle's post, matt.

I tried this out and, sure enough, up popped my site on the first page. Interestingly the number of returned results increased by over a million also.

Scarecrow

11:21 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



Do you really think it will take 50+ PhD's more than one year to increase the size of an identifier? LOL

Not if they had the green light to get it done. You're forgetting that corporations have management. If management decided to "make do" with an extra index instead of redoing all the software that used docIDs, then any number of PhDs at Google would be helpless. Maybe the problem didn't look that serious when the Google update first crashed in April 2003. Maybe they had other priorities (ads making money beyond all expectations, for example, and the marketing suits screaming for more resources to make even more money). Maybe they decided that concentrating on profits prior to the IPO was the smart thing to do. Maybe they thought they could finesse the multi-index approach so that no one would notice. ("Send in GoogleGuy -- he can head off any doubters with a couple of pithy phrases at WebmasterWorld!")

The docID is not a minor part of the overall scheme of things. The inverted index consumes a lot of space. Moreover, it's the first index that has to be consulted for every incoming inquiry (well, they probably cache the results for Britney Spears, but you get the idea). Given that it's consulted so frequently, this means you need multiple copies of the inverted index in a distributed system to maximize your throughput. The docID is used, on average, twice per word per web page. That's because they use two inverted indexes, the fancy and the plain. The fancy is much smaller but they also use the docID elsewhere in the system, so let's figure two docIDs per word per unique web page. Figuring the average page is 300 words, here are the space requirements for various lengths of docIDs, given 4 billion web pages, for a single fancy+plain pair of inverted indexes.

4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)
12 bytes: 300 * 4 billion * 24 = 2.9 to 13th power (29 terabytes)
20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)

My point is that when Larry and Sergey decided to expand the docID from 27 bits to 32 bits, they didn't say, "Heck, let's throw in an extra 32 bits in case the web increases beyond 4 billion pages." Remember, this was at a time when one billion pages on the web seemed like a wild overestimation.

It's not trivial to expand the docID -- that's all I'm trying to say.

TaylorAtCTS

11:26 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



I tried the keyword x16 and it works for my sandboxed site..

its crazy.. i never imaged but there it was in the top 3 instead of not found at all

Elixir

11:27 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



2X4. I am not technical enough to debate the technical issue with you but I totally agree that there are some people living in a dream or else they bought a lot of google stock and have to believe that nice google are just going to fix this. I lived through Florida as many here did and it was clear not too long after the fiasco that they knew they had messed up and it would get fixed. No comments, no announcements, there is nothing to say as they dont know what to do.

Scarecrow

11:44 pm on Feb 1, 2005 (gmt 0)

10+ Year Member



As queries can display a combination of sandboxed, and non-sandboxed items, they have increased the docID to more than 32 bits without doubt.

Not true. You can have an on-the-fly meta controller that responds to queries, and melds from index1 and index2. Both index1 and index2 use the same software, each limited by their 32-bit docIDs. The thing is, you must assign new documents into one index or the other index, because the docIDs overlap. They overlap because you have not rewritten the core software to expand beyond 32 bits for the docID. It's a lot easier to start a second index using old software than it is to rewrite (and debug) the entire system.

The meta controller says:

IF query is This or That, THEN use index1 only.

IF query is The Other Thing, THEN use index1 up to point A, and to fill out the query, use index2.

If 95 percent of your queries can be handled by index1, then this wouldn't even be that inefficient. Remember, not all searchers are SEOs looking deep for their keywords.

This way you can use two indexes without expanding the docID, as long as you take measures to try to make sure the new documents go in only one or the other. Heck, you can purge duplicates on the fly too, if it becomes a problem.

If the filters since November 2003 have shown anything, it's that they operate on the search terms in real time. And there's no longer any dispute that 1) PageRank is not calculated the way it was prior to April 2003; and 2) the Supplemental Index that started in August 2003 is an entirely separate index, and the reaons for it given by Google make zero sense; 3) there's a growing URL-only problem that is very, very suspicious and again, the reasons given by Google make zero sense; and 4) now we have the sandbox.

RichTC

11:57 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That keyword X16 is a great find. Amazing

A number of our keywords would position us between 10-20 in the SERPS (big money words) under some specialist sites which i would expect but above the dross which i guess would be a fair placing.

Currently our site is in the sandbox outside the top 1000 behind non relevant dross

What i do find amazing is that some of the sites that are currently ranking high under the search terms that i genuinely believe shouldnt list above us, fall out of the SERPS on the keyword X16 search.

The index is due one mother of a major update here, the current SERPS results are out of date and its clear to see.

The results X16 keywords are what we should be seeing. This manipulation of results will kill google off imo. How long can it sustain out of date search results?

TaylorAtCTS

12:31 am on Feb 2, 2005 (gmt 0)

10+ Year Member



all i can say:

hurray yahoo and MSN.. cuz im ranking quite well there

2by4

12:32 am on Feb 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"2by4 has dominated this thread with posting about (t)his theory nearly every three messages"

You made me wonder, man, did I post that much - too many I'll admit, but it's only like 1 out 9, we have to watch the facts here, even though it's not a popular thing to do for some really strange reason here. I used a strange method to come up with this result, it's sort of empirical, I counted.

TaylorAtCTS

12:34 am on Feb 2, 2005 (gmt 0)

10+ Year Member



this is so weird, my keyword x17 im not on front page, my keyword x15... not on front page

BUT keyword x16 and im #1, I just dont understand why this is

im going to try a few more things...

----

edit:

my competition doesnt show up on the first page with keyword x16....odd

lammert

12:40 am on Feb 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Scarecrow: I have found that parts of your post were actually copied from Daniel Brandt's site. I hope to hear your personal view on this soon, or maybe you are Daniel himself, please forgive me in that case.

2by4: I agree that there is something involved in increasing the size of the docID beyond the 32 bits limit. But it is not impossible. This is even admitted by Daniel Brandt, who came up with the 2^32 problem in June 2003 first, based on a single anonymous post in WW at June 7. 2003.

You asked me to be open to the facts and yes, I believe the sandbox exists; yes, the counter at Google's homepage is irregular updated and yes, Google has admitted they split indexes.

But also: queries give results from both sandboxed and non-sandboxed documents, so the indexes are connected, at least at the webserver level. site:www.example.com gives results from both (all) indexes.

Now lets see it from the other side. Suppose you and Daniel Brandt were hitting the right thing and I am terribly wrong.

First experiment: search for the word and. This is possible at www.google.nl, I don't know if this word is filtered in other languages. Google results with 8,000,000,000 pages. The same for a and the. Also 8,000,000,000 pages. Interesting, because these words are only present in English language documents, and not in all of them. Google didn't count, just returned some high value to impress us :)

Second experiment: search for is. Now we get 3,830,000,000 pages. I believe this one. It seems to be calculated, not a fake value.

Now we go to a "real" search engine Yahoo. The same words:

and: 1,810,000,000
a: 2,060,000,000
the: 1,900,000,000
is: 1,380,000,000

Seems to be real world values. If the amount of "is" can be used to calculate the relative size of both indexes, Google's index is larger by a factor 2.78.

Now back to normal search words.

"mesothelioma lawyer money" triggers without doubt a sandbox filter if it exists. Now we have 157,000 results on Google and 111,000 on Yahoo. factor 1.414 between the two engines.

"information for travel to europe" 17,000,000 results on Google, 12,000,000 on Yahoo. Factor 1.417. BTW, who is ranking first in Yahoo :)

"mortgage problem solving" 219,000 on Google, 145,000 on Yahoo. Factor 1.510

So for three money word queries in three niches, the indexes of Yahoo and Google have an average relative size difference of approximately 1.447.

On some other queries with other words:

widget: 4,230,000 to 931,000 : factor 4.544
Shakespeare: 14.900.000 to 7,990,000 : factor 1.865
Oak tree: 6.710.000 to 2,820,000 : factor 2.379
restaurant: 77.100.000 to 51,000,000 : factor 1.512

I tried some other keywords also. The interesting thing is, the index factor is close to 1.5 for some words, and higher for others, average a factor 2.5 to 3.0, the estimated size difference between the two search engines.

"Shakespeare" doesn't display ads in Google SERPs, "Oak tree" doesn't display ads in Yahoo SERPs, but they are present in Google SERPS, at least when viewing from the Netherlands.

So which conclusion is possible.

  1. The Google index is split in two indexes. The primary index is about 1.5 the size of Yahoo's. Both indexes together are about 2.8 larger than Yahoo.
  2. The secondary index seems to be totally discarded for a specific set of words. In that case the index difference factor is near 1.5
  3. The decision to use one or two indexes is not directly based on ads displayed on the SERPS, as I see uses of the secondary index where Google shows ads (oak tree) and use of the primary index only when there are no ads on Google but there are ads on Yahoo (Shakespeare). So the selection of one or two indexes is not directly triggered by displaying ads.
  4. Google also uses the secondary index for words where it would not be necessary. For examples "widget" results in millions of words, so the secondary index has no value at all. But it is used still

What do I conclude from this:

  • There is a sandbox (secondary index) which is not used for a certain group of search words.
  • The use of the secondary index is not directly triggered by ads displayed on the SERPs, but rather by a separate fixed words list. This would indicate that promotion of Adwords is not directly connected with the sandbox, although there might be a connection in the background.
  • Google uses the secondary index even, if it is not necessary with milions of results from the primary index. If the secondary index costs a lot of computing power, It would only swing in when absolutely necessary. So I don't think the secondary index is there to overcome a capacity problem alone.

And now the final one, the -asdf*13 trick. I tested it with a few words:

restaurant: 32,900,000
shakespeare: 8,530,000
mesothelioma lawyer money: 229,000
mortgage problem solving: 249,000
information for travel to europe: 19,400,000

Totally different results. You would expect that - if this trick caused the search in both the primary and secondary index - these figures would be much higher. Actually single keywords are only for 50% counted and high value keywords are only slightly higher.

So my opinion, Google guys just added the -asdf*13 as a gadget to their index to fake SEOers and are reading this thread and laughing at all those people who think the -asdf*13 shows the real results.

Sandbox? yes
Secondary index only used when technically necessary? no
Sandbox intentional: yes
-asdf*13 is reality: doubtful

This 367 message thread spans 13 pages: 367