Forum Moderators: Robert Charlton & goodroi
If your site is less than a year old you are likely sandboxed.
I can't believe most sites under a year's age are in some sort of penalty box. Google would be useless. So, I want to know:
1. Are all sites sandboxed, or do certain traits (like affiliate links, low content) trigger it?
2. How long does it last?
3. How variable is the duration?
4. How do you know your site is being sandboxed?
5. Does the effect taper off or is it a binary thing?
6. What gets you out of the sandbox? Is it merely time or do good links or whatever speed it up?
Thanks.
I am telling you, when I search using Keywords + trademark (which is also domain name) - my site is still nowere. I get tons of descriptions of our product on "old" sites instead.
---if you don't like google make your new sites or focus your "sandboxed" sites for traffic from yahoo or msn.
Thanks god I did. And getting some traffic.
When I do site:mysandboxedsite.com - it's there! It's there in the Google databanks! The problem is is that they choose not to use this data in constructing SERPs.
Please correct me if I'm missing something important here - but surely this is not indicative of a capacity problem...
So Google has a query system which is capable of searching both indexes at once and also capable of giving one index more priority than the other, depending on keywords. This is actualy from a software point of view more difficult than a query system that treats both indexes equal all the time.
There may be index problems to store all webpages in one index, but there are no query problems with more than 2^32 pages spread over more than one index. They handle this situation transparant for the searcher. Normal, and supplemental results can be present in one SERP.
Which leads IMHO to just one conclusion: the sandbox is there intentional, not by accident, or because of index capacity problems.
Doesn't this make the sandbox story just a wee bit more worthy of media coverage?
No. It's perfectly normal for a company to include worst-case scenarios in its prospectus. If a cruise line included a statement that one of its ships could sink, would you take that to mean they were worried about the seaworthiness of their vessels?
You have to look at the bigger picture. If G wanted to increase revenue via AdWords channels, why not sandbox ALL sites - OLD established sites with deep pockets?
I believe that this is very obvious. If they applied it to all sites it would have been noticed instantly and it would have created real bad publicity. Doing it with new sites let's it happen gradually and eliminates risk.
If Google really wanted to put the squeeze on potential AdWords advertisers (an allegation that gets repeated here several times a day), they could simply reweight their algorithm to favor information pages over boilerplate affiliate pages, e-commerce catalog or order pages, and "scraper" directories. Such a change would be true to Google's corporate mission statement, it would improve the quality of search results in the eyes of most users, and any boost to AdWords revenues could be defended as a happy coincidence.
Google inverses the document index to a word index, i.e. for every possible search wordt they maintain a list of document entries that contain that word. If there were a 2^32 index limit, it is not a 2^32 document limit, but a 2^32 keyword limit. There is also no sign that their index has a 2^32 (4 Gigabyte) size limit. I quote:
The raw documents comprise several tens of terabytes of uncompressed data, and the inverted index resulting from this raw data is itself many terabytes of data.
If you read this document - it is very interesting - you will see that they had just one thing in mind: scalebility, not a fixed index with 2^32 entries. In 2003 they had "a few thousand machines per cluster". If you need to serve twice as much webpages, simply double the number of machines and you get the same querytime.
if ( sandbox!= index capacity problem ) then sandbox = intentional; endif
Google's goal is to make all information available through the web. If they don't see e-commerce sites as valuable as editorial sites, it might be the reason to give one site better rankings than others. Older sites have the benefit of the doubt. If this is the case, then no-one who is in the sandbox with a site type which Google doesn't like will ever come out, unless the site content changes. In this case the sandbox itself is intentional.
On the other hand if the sandbox is there to identify and eliminate spam sites, than genuine sites - both e-commerce and editorial - would see their ranking increased after some months in the sandbox. This would be a sitiuation where the sandbox is only a side effect of spam elimination.
We see that many sites are in the sandbox for quite an amount of time now, longer than what I would expect to be necessary for deciding if a site is genuine or spam. This makes me think the sandbox is intentional indeed.
... clouding the waters...
Lammert,
Google's goal is to make all information available through the web.
False. Google's goal is to make as much money as it could through Adsence/Adwords to maintain it's survival.
But as its reckless management may gaming too high it seem right on the way to ruin its fundamentals the SE.
BTW:
There are a lot of once big players out crying to have lost developer support...
What I do like to discuss is the index theory and the relation of it with the sandbox. 2by4 has dominated this thread with posting about (t)his theory nearly every three messages, and I am really looking forward to read his comments on my posts.
This is a thread where the last 150 posts were hijacked by the idea that Google is already more than a year facing massive capacity problems because of some sort of 2^32 issue. Unfortunately the only reactions I get are about minor issues in my posts, not about the real thing.
Sorry that I posted real world information to this thread. I guess most of you liked speculations and conspiracy theories more than the truth.
- Lammert
For those who are still convinced that the sandbox is the result of index problems because the amount of stored webpages went above 2^32, ... blah, blah
Here's another fun paper [www9.org], co-authored by a Google researcher:
"Rather than deal directly with URLs, the Connectivity Server uses a set of densely-packed integers to identify pages."
"Recall that page identifiers are a dense set of integers."
"To avoid wasting space, we pack vector records densely."
"Functions in the Connectivity Server convert between these integers and text URLs. In our work with the Connectivity Server, these identifiers have proven more convenient to handle in code than text URLs."
"Notice the use of integers to represent terms; as with page IDs in the Connectivity Server, we find these to be more convenient to manipulate than text strings."
"The page ID of the vector is stored in the first 4-bytes of the vector's record."
Now look at The Anatomy of a Large-Scale Hypertextual Web Search Engine [www-db.stanford.edu] that describes the original Google architecture.
Look about halfway into the paper, where it says, 4.2.6 Forward Index. There is a diagram on the right side of the page. They show the docID as 27 bits. In the text they explain that the docID is combined with the length of the hit list. In other words, in the beginning they only had a 27-bit docID. Obviously, they had to break off the 5 bits for the length of the hit list into a separate byte fairly soon after that, because 27 bits gives them only 134,217,728 unique docIDs instead of the 4.2 billion we all assume.
There is no question whatsoever that for years Google used a 32-bit docID. The question is when did they change it, if they did?
Sure, you can have two, three, or many indexes. The problem is that you run into difficulties merging them into a coherent ranking forumula unless every unique web page has a unique docID. That is exactly what we're seeing today.
Happy? You got two. The threads weren't 'hijacked', we're not talking about 'conspiracies', what is up with this kind of talk? The subject is the sandbox. All of the issues brought up are directly related to the sandbox. The sandbox has great interest to people. That's why these threads get so long. The subject has never veered from the sandbox. Different causes and explanations for the sandbox have been offered, as well as a possible way out, useful if you are in possession of an authority type link farm like optiplex is. Some posters insist on maintaining that a system that is increasingly starting to look very broken is not broken. Why. No idea, I think some people need to believe in something infallible, and have chosen google to occupy this position. God works better, buddha, but if google does it for you, go for it.
lammert, that's a good article. But it doesn't really say anything that new, although it does make clear one thing:
google will not use hardware that is not cheap. Which explains the lag to fully upgrade the systems I think. Price/performance/heat is critical to them. The prospectus quote given shows extremely clearly that not only is google planning on updating their systems this coming year, or next at latest, they do not have total faith that the process will succeed. Just how much concrete evidence is required to get this point home? How much failure will it take until people in general will admit it? Obviously far more than I thought. That is not a standard disclaimer, that's a very carefully written passage that is covering google's as$ from the lawsuites people suggested. Buyer beware.
it's the docid's that are the restricting factors if I remember correctly. If you read back through that you'll see that. And it's the docids that were written with 2^32 limitations. That's why, if you were watching, google's home page sat stuck at that number all last year. Since this fact was self evident I find it hard to argue it with anyone unless they can offer a reasonable explanation why that number was stuck there. Many people have offered such explanations, but for some reason they just won't stick.
The terabytes of data are the actual content that is retrieved using the primary index, someone correct me if I'm wrong here, I'm sure you will. There's no restriction on how much data than can store, there's a restriction on how many documents can be indexed within a single index.
So let's see where we are now: google's prospectus has no meaning. the stuck pages indexed has no meaning. The overnight doubling of the index has no meaning. Google representatives telling someone directly that they run 2 main indexes has no meaning. I have to conclude that there is nothing anyone can say to some of the posters on WebmasterWorld to get them to comprehend the reality that is sitting in front of their faces. Maybe google is doing more PR work than I first thought, either that or some people just didn't learn any critical thinking skills in school.
<added>Scrarecrow, LOL, we posted at the same time.
[edited by: 2by4 at 9:54 pm (utc) on Feb. 1, 2005]
Well put. 1: hard to tell, they say the insane think that they are sane, so you'd have to ask someone else.
2. About the opposite of what optirex recommends.
I had to rebrand it anyway, so I figured, hey, what better time than now, I want to really see the sandbox first hand, LOL, and I am seeing it. Why not, traffic was jumping up really fast, everything I put on it was hitting top 10 in a few days, what better time to do a rebranding... Actually, I really thought that the 301's would cut down the wait as google realized it was the same site, but no such luck.
Again, when I did that I didn't expect that Google was as seriously broken as I'm now starting to realize it is, I was looking at maybe 4-6 months, but now I realize that it's not a question of time, it's a question of when google upgrades to the new version, I'll call it google II. I've sacrificed this site before, oddly enough both times google has been directly at fault for that, but in two totally different ways.
[edited by: 2by4 at 10:09 pm (utc) on Feb. 1, 2005]
I tried it and my sandboxed site is in the top5! Just like MSN/Yahoo!
I just about fell off my seat in shock!
There is my sandboxed site staring back at me.
I normally can’t find my sandboxed site in the Google SERPS unless I include the brandname/url in the search phrase.
This post is WebmasterWorld working at it’s best as many people are sharing helpful little bits of the puzzle.
two good articles you came up with. The first is however IMHO not relevant for the Google index duscussion, because in the last paragraph it says: "[...]at Compaq Computer Corporation's Systems Research Center, which is where the research described here was done." So I see no connection with Google. Also all testing was done at Altavista and Yahoo data and on Alpha hardware, not on Google data and PC's.
The second document however is without doubt related to the current Google search algorithm. You are right that the 27 bits used for the docID only gives some 100 million documents to store. So they must have changed this approach even far before 2003. If it was possible to move from a 27 bit wide docID to lets say a 32 bit wide first (necessary for 4G documents) and reformat the datarecords for this new fieldsize, there is no doubt they could also expand it to more (64 would be the most logical step looking at their x86 based architecture). And such a size increase of just one field can be done overnight by taking one cluster of line, reformatting the records and than bringing the cluster on-line again. The main search algorithm is not affected severly by only a field size increase.
As queries can display a combination of sandboxed, and non-sandboxed items, they have increased the docID to more than 32 bits without doubt. Even when they use separate indexes with a maximum of 4G docs each (which is possible because of the distributed design) there still has to be an ID for the index. In the case they didn't increase the docID but instead added a indexID, they effectively created an globaldocID represented by the pair (docID,indexID). And this globaldocID is still a unique representation of each document.
For searching there is no implicit docID field size problem involved. The docID is an ID, a number which could be anything to reference a record. The search algorithm itself is not dependent on the size of the docID because the algorithm is searching on words and returns the docID. Only for pagerank calculations every document must be indexed seperately, but pagerank is a value assigned to each page, not to a search query and can therefore be calculated off-line in batch runs.
2by4:
First of all I don't believe the counter at the homepage of Google. They just give it a value for political reasons, like starting discussions here :)
I have recently added many pages and they all showed up in the primary index within 24 to 48 hours. They are in the top 10 for competitive keywords, but yes, I have an old site, existing even before Google existed...
Don't get me wrong. I do not say there is no sandbox. Seeing all the people having problems to get new sites at a reasonable position in Google is a clear indication that the sandbox exists. What I wanted to say is that this is not because of a technical problem, but Google has reasons to do this. Either to discourage scraper sites, or to to boost information sites in favor of commercial sites, or as some people here have suggested to boost the use of Adwords. And IMHO they have the right to do this. It is their index and we don't pay for the use of the index. We only complain when our own site doesn't rank in the top 10 listing.
But please see everything in the right perspective. Google started with a 27 bit index, not 32. So if they were capable to upgrade from 27 to 32 they are capable to upgrade to every ID they want. How many PhD's are working there? Do you really think it will take 50+ PhD's more than one year to increase the size of an identifier? LOL
If you are thinking that every piece of information comming from the Googleplex should not be taken seriously, I think we have come to the point of religion, not science and I don't want to go into a religious debate. You are free to think what you want. Because all your assumptions are based on this one thing: "Google doesn't tell the truth", it is very difficult for me to find arguments that you are interested in to hear.
Hmm. Why? I mean, why without a doubt? Why was 32 bit chosen? What type of systems are running this? The choice of 27 makes sense, smaller data set, moving to 32 on 32 bit machines makes sense. I wish I had a better background in hard core db number crunching, there used to be guys here who did, they'd usually fill in the details, sigh...
In the original thread on the capacity issue, one or two guys who obviously knew how to work with very large systems, and who understand these questions much better than me, do not agree with the statement that doing that type of upgrade is easy or trivial, and I'd suggest that the text of the prospectus really admits this very clearly, it's not easy, and they haven't done it yet.
Re believing or not believing what google says: I don't believe or disbelieve them, their words have about the same value as the words of any other large business enterprise. Sometimes they will be true, sometimes they will be half true, and sometimes they will be flat out lies. However, you'll note, I do believe some of the things they say, for example the text on the prospectus, which has to be true, for legal reasons, I believe. Personally, I think the index page count was a vestige of Google really not wanting to be evil, I see it as sort of the victory of honesty for that time internally. They would have been better off simply removing that IMO, but they didn't. I think there are googlers who really took the whole don't be evil thing seriously, I suspect now the ones who got good stock option packages are willing to retreat from that position a bit, like most people would if offered the same level of wealth.
"I have recently added many pages and they all showed up in the primary index within 24 to 48 hours. They are in the top 10 for competitive keywords, but yes, I have an old site, existing even before Google existed..."
To say the primary index is full is like saying a glass is full on a sunny day. Water evaporates, sites and pages come and go, it's not totally full, it's just pretty much full. But even there, I've seen radically longer lags than I used to for adding large blocks of new pages, also the subject of some past threads here. Why? To me it looked for all the world as if the spider was just waiting for permisssion to grab them. PR obviously would affect a site's priorities, WebmasterWorld for example will get all its pages in fast. But smaller sites, lower pr sites, also used to get all their pages in fast, that's not the case anymore.
"So if they were capable to upgrade from 27 to 32 they are capable to upgrade to every ID they want. How many PhD's are working there? Do you really think it will take 50+ PhD's more than one year to increase the size of an identifier? LOL"
I'll admit something, I grew up around phds, and I don't have the faith you seem to have in that label. Upgrading to 32 bits in a 32 bit system, yes, that seems quite simple. Updating to a 40 bit system within a 32 bit system, not the same game, if I knew processing a bit better I'd be happier I'll readily admit, someone a while ago pointed out that the newer intels can address more than 32 bits of memory space, but I don't think that's the actual roadblock, I think it's more basic than that.
Microsoft fails to do things routinely, they have lots of phds, and a research budget that is usually higher I think than Google's gross income though they just cut it in half. Their phd collection failed to create their new filesystem, due in NT 4, then in 2000, now in longhorn. Many phds do not equal success. In fact, sometimes it equals failure.
But a thoughtful post nevertheless, much appreciated.
Hasm, very interesting find with your keyphrase x16 – although I’m still not sure exactly what it means…
I think you were referring to Mahoogle's post, matt.
I tried this out and, sure enough, up popped my site on the first page. Interestingly the number of returned results increased by over a million also.
Do you really think it will take 50+ PhD's more than one year to increase the size of an identifier? LOL
The docID is not a minor part of the overall scheme of things. The inverted index consumes a lot of space. Moreover, it's the first index that has to be consulted for every incoming inquiry (well, they probably cache the results for Britney Spears, but you get the idea). Given that it's consulted so frequently, this means you need multiple copies of the inverted index in a distributed system to maximize your throughput. The docID is used, on average, twice per word per web page. That's because they use two inverted indexes, the fancy and the plain. The fancy is much smaller but they also use the docID elsewhere in the system, so let's figure two docIDs per word per unique web page. Figuring the average page is 300 words, here are the space requirements for various lengths of docIDs, given 4 billion web pages, for a single fancy+plain pair of inverted indexes.
4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)
12 bytes: 300 * 4 billion * 24 = 2.9 to 13th power (29 terabytes)
20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)
My point is that when Larry and Sergey decided to expand the docID from 27 bits to 32 bits, they didn't say, "Heck, let's throw in an extra 32 bits in case the web increases beyond 4 billion pages." Remember, this was at a time when one billion pages on the web seemed like a wild overestimation.
It's not trivial to expand the docID -- that's all I'm trying to say.
As queries can display a combination of sandboxed, and non-sandboxed items, they have increased the docID to more than 32 bits without doubt.
The meta controller says:
IF query is This or That, THEN use index1 only.
IF query is The Other Thing, THEN use index1 up to point A, and to fill out the query, use index2.
If 95 percent of your queries can be handled by index1, then this wouldn't even be that inefficient. Remember, not all searchers are SEOs looking deep for their keywords.
This way you can use two indexes without expanding the docID, as long as you take measures to try to make sure the new documents go in only one or the other. Heck, you can purge duplicates on the fly too, if it becomes a problem.
If the filters since November 2003 have shown anything, it's that they operate on the search terms in real time. And there's no longer any dispute that 1) PageRank is not calculated the way it was prior to April 2003; and 2) the Supplemental Index that started in August 2003 is an entirely separate index, and the reaons for it given by Google make zero sense; 3) there's a growing URL-only problem that is very, very suspicious and again, the reasons given by Google make zero sense; and 4) now we have the sandbox.
A number of our keywords would position us between 10-20 in the SERPS (big money words) under some specialist sites which i would expect but above the dross which i guess would be a fair placing.
Currently our site is in the sandbox outside the top 1000 behind non relevant dross
What i do find amazing is that some of the sites that are currently ranking high under the search terms that i genuinely believe shouldnt list above us, fall out of the SERPS on the keyword X16 search.
The index is due one mother of a major update here, the current SERPS results are out of date and its clear to see.
The results X16 keywords are what we should be seeing. This manipulation of results will kill google off imo. How long can it sustain out of date search results?
You made me wonder, man, did I post that much - too many I'll admit, but it's only like 1 out 9, we have to watch the facts here, even though it's not a popular thing to do for some really strange reason here. I used a strange method to come up with this result, it's sort of empirical, I counted.
2by4: I agree that there is something involved in increasing the size of the docID beyond the 32 bits limit. But it is not impossible. This is even admitted by Daniel Brandt, who came up with the 2^32 problem in June 2003 first, based on a single anonymous post in WW at June 7. 2003.
You asked me to be open to the facts and yes, I believe the sandbox exists; yes, the counter at Google's homepage is irregular updated and yes, Google has admitted they split indexes.
But also: queries give results from both sandboxed and non-sandboxed documents, so the indexes are connected, at least at the webserver level. site:www.example.com gives results from both (all) indexes.
Now lets see it from the other side. Suppose you and Daniel Brandt were hitting the right thing and I am terribly wrong.
First experiment: search for the word and. This is possible at www.google.nl, I don't know if this word is filtered in other languages. Google results with 8,000,000,000 pages. The same for a and the. Also 8,000,000,000 pages. Interesting, because these words are only present in English language documents, and not in all of them. Google didn't count, just returned some high value to impress us :)
Second experiment: search for is. Now we get 3,830,000,000 pages. I believe this one. It seems to be calculated, not a fake value.
Now we go to a "real" search engine Yahoo. The same words:
and: 1,810,000,000
a: 2,060,000,000
the: 1,900,000,000
is: 1,380,000,000
Seems to be real world values. If the amount of "is" can be used to calculate the relative size of both indexes, Google's index is larger by a factor 2.78.
Now back to normal search words.
"mesothelioma lawyer money" triggers without doubt a sandbox filter if it exists. Now we have 157,000 results on Google and 111,000 on Yahoo. factor 1.414 between the two engines.
"information for travel to europe" 17,000,000 results on Google, 12,000,000 on Yahoo. Factor 1.417. BTW, who is ranking first in Yahoo :)
"mortgage problem solving" 219,000 on Google, 145,000 on Yahoo. Factor 1.510
So for three money word queries in three niches, the indexes of Yahoo and Google have an average relative size difference of approximately 1.447.
On some other queries with other words:
widget: 4,230,000 to 931,000 : factor 4.544
Shakespeare: 14.900.000 to 7,990,000 : factor 1.865
Oak tree: 6.710.000 to 2,820,000 : factor 2.379
restaurant: 77.100.000 to 51,000,000 : factor 1.512
I tried some other keywords also. The interesting thing is, the index factor is close to 1.5 for some words, and higher for others, average a factor 2.5 to 3.0, the estimated size difference between the two search engines.
"Shakespeare" doesn't display ads in Google SERPs, "Oak tree" doesn't display ads in Yahoo SERPs, but they are present in Google SERPS, at least when viewing from the Netherlands.
So which conclusion is possible.
What do I conclude from this:
And now the final one, the -asdf*13 trick. I tested it with a few words:
restaurant: 32,900,000
shakespeare: 8,530,000
mesothelioma lawyer money: 229,000
mortgage problem solving: 249,000
information for travel to europe: 19,400,000
Totally different results. You would expect that - if this trick caused the search in both the primary and secondary index - these figures would be much higher. Actually single keywords are only for 50% counted and high value keywords are only slightly higher.
So my opinion, Google guys just added the -asdf*13 as a gadget to their index to fake SEOers and are reading this thread and laughing at all those people who think the -asdf*13 shows the real results.
Sandbox? yes
Secondary index only used when technically necessary? no
Sandbox intentional: yes
-asdf*13 is reality: doubtful