| This 354 message thread spans 12 pages: < < 354 ( 1 ... 2 3 4 5 6 7 8 9 10 11  ) || |
|Why does the 'Google Lag' exist?|
Trying to understand its purpose.
I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
|Also I have not seen any site come out of it without a dmoz link. |
Untrue, I had one come out in may without a DMOZ link.
>Suppose you had a 1200 page website, but for some technological reason only 1000 pages could appear on the Internet at one time. What would you do? Not publish the newest 200 pages, or not publish what you consider the 200 worst/weakest pages? If Google has a capacity problem, why not just remove all PR0 pages from the index?
Because Google if it was a capacity issue should want to omit the pages most likely to be problematic from the index. If they did what you suggest above, this would mean new pages on established news sites wouldn't be indexed because they will start out at PR0 when they are found. This would make the index look really stale. A spammer can easily get his site above PR0 with a few decent PR links from other sites he controls. The spammer modus operandi is to keep tossing up new domains, and as the older ones are zapped, the new ones get take their place. Thus to deal with these spammers, Google decided that the worst/weakest sites likely were the newest ones.
Just had a site (not listed in dmoz yet) that popped out of the blue for some really serious design searches - not regional. The site is pr6 and the thing that seemed to play the trick was tweaking really old (pre-sandbox) links together with onpage content optimization.
The same thing seems to work but a lot slower with a pr5 site also it looks to me that May-start June links are just beginning to be factored in... with lower pr5 and pr4 sites it's all quiet yet...
Which leads to thinking - can this be that time of lag is inversely proportional to your absolute PR in any way? Also it does look like pre-sandbox links or links that stayed unchanged at the same urls for over a certain time are treated a lot better (this was mentioned quite a few times here before) and their degree of 'reputability' and being factored in is proportional in a variable way to the PR of pages they lead to. From what I saw this also might depend on multiple other factors like pr of the page linking to you, their domain's 'reputability', surrounding text relevance etc etc etc.
Anyone had experience with older established sites creeping out of sandbox by the similar means?
>Just had a site (not listed in dmoz yet) that popped out of the blue for some really serious design searches - not regional. The site is pr6 and the thing that seemed to play the trick was tweaking really old (pre-sandbox) links together with onpage content optimization.
There has been a lot of speculation, and some evidence, that it isn't new sites that are sandboxed, but in fact new inbound links. If you think about it, in the case of genuinely new domains, these will just have newer links, as people don't link to domains that aren't even registered yet. (And presumably Google would just ignore links to non-existant domains.) Thus, if it is new links that are sandboxed, to a casual observer it might appear it is based on the newness of the domain.
Your experience would be consistent with it being new inbound links are what are sandboxed. Because you did have some really old links, your site never was fully sandboxed. By tweaking those links plus on page content you were able to get your site to rank well. PR6 is pretty solid. With old links good enough to get this site to PR6, tweaking both the links and the on page content could ne enough to get decent rankings.
Whatever the sandbox is, the theory of starting out a new domain with minimal content and some solid inbound links has appeal. Don't do this, and when real content is added to the site, it won't go anywhere because of the sandbox. However, just let the site age until it is "ripe", and then slap up the content and tweak the links, it will soon rank well.
Thanks rfgdxm1 - my thoughts entirely regarding letting the new site 'ripen up' before slapping the content up the pages.
|However, just let the site age until it is "ripe", and then slap up the content and tweak the links, it will soon rank well. |
It would depend on your definition of "soon". One site that I created for a small consultancy has lots of original and interesting content. It is as good a resource as any on the subject matter but it is nowhere to be seen after six months.
> Do others agree that there was generally only one period in which sites were allowed in from the sandbox? If so, were all sites allowed in or just a few? Have there been other times when sites were allowed in?
I recall ftping some newly created single pages, not whole sites and indeed that must have been Mid June or so. They did get PR, while all pages created since then did not (the index.html page is within google's for four years now).I can't tell about Feb-Apr because I was off then, but if you say the sandbox phenomenon dates back to march then, yes, there must have been a short exceptional period.
<<Do others agree that there was generally only one period in which sites were allowed in from the sandbox?>>
Yah, sometime in May, sites that were being boxed got loose which led many of us to believe there was simply a 2-3 month holding period. Since then it is as if time has stood still!
One quick point. Many here keep stating that G is fighting commercial spam here. Besides the obvious fact that this measure has ZERO impact on the existing serps and the really professional SEO/spammers have found ways around it anyway, countless non-profit sites are in the box as well...
Is there any reason to believe the supplemental index is a 32-bit process based index, or is it possible the supplemental index is a 64-bit based index? Because if the supplemental index is 32-bit, then it too would be full at 4,285,199,774 web pages (or 99.8% of maximum capacity), and it would be full in less than 2 years*. So, wouldn't it make sense to go to a 64-bit system, even if they stay with the 5-byte index for the sake of reducing the size of the iterated matrix?
* Google had 1B pages indexed in 2000 and 4B last year, that's a growth rate of about 2^(1/2) per year. The number of indexes will double every two years, perhaps the moore's law of indexable information.
When I mentioned DMOZ in my last post, I did not mean to imply that DMOZ carried any extra weight or was a way out of the sandbox. I was instead trying to say that the site had some decent incoming links, one of them being DMOZ.
From the latests posts I have read, there generally seems to be a consensus that the sandbox started sometime early this year, maybe about February, March. There was also only one period, about May (or June) where all sites were allowed in. If others do not see it this way, please let us know what you think on this and why.
If all sites were allowed in, this would appear to have little to do with spam fighting.
Perhaps what we are seeing is something like this. Google has an index that got full. They have a second index that newly crawled sites go into since the first index is full. Their search calculations cannot work across both indexes. At certain intervals, they allow sites from the second index "into" the first index thereby making those sites available to appear (in earnest) in their search results.
Obviously this is just a guess. I don't have reasons to give as to why their search calculations cannot work across both indexes and how they let sites from the second index into the first. I know it has been proposed that sites in the first index could be removed if pages were no longer available or if they were found to be of low quality but I am not sure if I buy into that theory.
|They have not beat the sandbox! |
We have a phrase for comments like this one...we call 'em "the world is flat comments". :-)
|This knocked out the single largest short term threat to G's future quality...not a small thing with an IPO and attendant scrutiny on the horizon. |
3 months ago I'd believe you. Now, the IPO seems like a cozy wave-away excuse much like the "it reduces spam" line. ... I'd believe "capacity issues" before "spam fighting". ... But, if capacity issues are the real reason, I seriously doubt G would take 8 months to fix it. Capacity issues would be, I would consider, a major "drop everything now" type thing. Also, they would see something like that coming - the growth of the web is fairly linear.
The thing is, not all major decisons or events at companies are linear, or even planned. What if the capacity issue intersected with resource calls they had to make. Being mindful of the spam issue, they found that:
--the lag had the interesting side effect of discouraging newbies from blasting out spam sites and embarrassing G
--the public didn't see or care about any differences they were seeing in the SERP's.
--their SOM stayed constant.
Suddenly, and especially with the IPO coming, any incentive to move quickly to upgrade systems was largely nullified. And BTW, none of this precludes them from still working on algo changes to continue to fight spam algorithmically.
Plus, if they are planning an entirely new approach to managing their SERP's, based on some of the new areas they've delved into, then this gives them needed time to get it all right before launch.
|founders' and management comments on info versus commercial sites; |
Naw, caveman. The Google "nice guy" line worked two years ago. I don't believe it anymore. There are a lot of good people working at Vendor G now, but let's face it; the minute they went public, their management ceased to be a bunch of guys concerned with changing the world. The "new" management is the American economy, and the American economy demands profits.
Actually, I never believed that the founders' bias towards info sites over commercial sites was a 'nice guy' thing. I thought it lacked an understanding of the real world, in which people actually do search for information on commercial goods and services.
<<< founders' bias towards info sites
caveman, that last post is about as sensible a thing as I have read here in the last year.
I know from doing info sites that the reason it's so darned easy to get stuff in the serps is that the content is written chock full of info, as soon as you start writing 'about' something, you have to use different words, phrases, etc, the more there are, the more there is to find, and what the google algo excells in is extracting, you guessed it, relevant 2-3 word info phrases. So it's not so much that an info site is favored, it's that it has more real information on it, which is as you note what most searchers are looking for.
Sort of going along with that 'what' thing, google is good at pulling out 'what', not 'why'. Once I learned this, it became very easy to write FOR google. Now when I do a posting on WebmasterWorld, especially html and css forums, I often consciously decide whether I will include good serp, what filled content, or go very vague, zero serp result content... very sad comment on how information is gotten now, but it's the way it is.
[edited by: isitreal at 5:06 pm (utc) on Oct. 6, 2004]
Gee thanks isitreal. :-)
FWIW, WRT links, I don't know what role "newness of links" plays, but I can tell you that we have several new sites launched in March that are doing OK. New domains, new links. New links may play a role in this, but they do not necessarily *hurt* a site.
It's also worth remembering that while PR is all about links, G's algo is all about patterns. And then there are those pesky filters. And perhaps most important, ultimately, these things all exist to help determine measures of *quality* as seen through the G lens. ;-)
IMHO, there have been at least 20 or so posts in here about what it takes to get past the sandbox. I think people intuitively know, but just aren't getting it done. This work is not getting any easier. Certainly, the sandbox has contributed to that, which can't be a bad thing from G's POV. As has already been noted, there aren't many webmasters anymore boasting about how easy it is to game G. :-/
They have not beat the sandbox!
|We have a phrase for comments like this one...we call 'em "the world is flat comments". :-) |
I have a phrase for people who say things like, "I beat the sandbox". Prove it or don't slag me off.
|the lag had the interesting side effect of discouraging newbies from blasting out spam sites and embarrassing G |
--the public didn't see or care about any differences they were seeing in the SERP's.
After eight months (and counting) of virtually no new content on Google's version of the Internet it's only a matter of time until the public (and hence the press) catch on.
|Actually, I never believed that the founders' bias towards info sites over commercial sites was a 'nice guy' thing. I thought it lacked an understanding of the real world, in which people actually do search for information on commercial goods and services. |
This was not a 'nice guy' thing. Remember that commercialism was a side effect of the Internet, which was not designed to be a commercial entity. The Internet, you may remember used to be referred to (still is?) 'The Information Superhighway'. Not the 'Yellow Pages Super Highway'. I am involved in a business, I even have an Adsense account, but I still believe that information sites should be preferred to commercial.
>Prove it or don't slag me off.
And as Jake suggested, just because you can't see it doesn't mean it's not happening around you. :-)
Yours was the kind of comment that has a tendency to put off those who, while not willing to share trade secrets, might be willing to at least offer some information, or point people in the right direction.
When people try to help, if you can't be nice, perhaps you shouldn't say anything at all. :o Or, at least try to be contributory.
My partner was right...should just keep my mouth shut...
|these things all exist to help determine measures of *quality* as seen through the G lens. |
Although I believe this is part of what they are trying to do, my research indicates that this is not part of the sandbox phenomena.
|we have several new sites launched in March that are doing OK |
Who woulda thunk that 7 month old site's ranking in google would become newsworthy?! :)
<<<< just because you can't see it doesn't mean it's not happening around you.
Yep, seeing the same thing on other topics, claims that something can't be done, then we do it, and it works fine. Just to be clear however, you are talking about multiple thousand [or more] result type keyword searches, correct?
Earlier 'lag' threads said ALL sites, this was easily disproven by putting up a site with niche type keywords and not having it sandboxed. Which meant that 'lag' was not a generic event, applied to all new domains, but the result of a filtering process of some type to determine which sites get placed in that. And a filter has holes, that's why I don't tend to disbelieve the claim that it can be gotten around. Hackers always laugh when someone says: my system cannot be hacked.
>Earlier 'lag' threads said ALL sites, this was easily disproven by putting up a site with niche type keywords and not having it sandboxed. Which meant that 'lag' was not a generic event, applied to all new domains, but the result of a filtering process of some type to determine which sites get placed in that.
Yeah. I don't quite get why that has not been noted more often. Clearly the sandbox is not universal; that is easily seen. So it should not be a hard leap that it is either algoritmic or filter based or both. So it should not be a hard leap that not all new sites (or searches related to those sites' pages) are sandboxed. But that last leap seems to be hard for some. :-/
I've said before, we find it useful to think of this as a tough algo with tightened filters, for which certain hurdles need to be met or exceeded. Also, less than a third of our new sites passed muster, so far, and we can't exactly say why, though we have theories. We can only say that some have, and some have not. I think mfishy said something about it being perplexing.
And this: on a sandboxed rebranded 301'ed site, the original material from the original domain name is sandboxed heavily, by url I am assuming, but new material is ranking fine.
I experimented with this deliberately by adding content pages that were so specific that they would show in serps after the rebranding if new pages were escaping the sandbox, and they do. They started showing after about 1 month. So it's not an absolute site / domain type thing either in all cases.
However, a rebranded site is a different thing than a brand new site, but I think it does demonstrate some of the processes going on behind the scenes, perplexing would be the keyword here :-¦
My suspicion is similar to something said earlier, that what I'm seeing is a duplicate content type thing happening, except the original source is gone, doesn't exist anymore except as a 301 directive, but the system is so slow to update itself fully that things like this are falling between the cracks?
This ties in to what I said earlier about 3 or 4 pages, several of which have not existed for about 1 year now, showing up in a site:originaldomain query. There does not appear to be a single index working, and if there is more than one, the integration between them seems to be flawed.
Could it be that there are just too many hacks being applied?
<<<<<But I still don't believe the capacity issues, guys. I just think Google is smart enough to see something like that coming.
BakedJake, MS has been working on their new file system since NT 4. They have a lot of smart people. It was supposed to be up by NT 5. Then it was supposed to be available on Longhorn. That's a very long time. And they still can't get it working. With a 5 billion or so a year research budget. It's not a matter of being smart enough, it's a matter of the problem being very hard to solve I think, and events moving faster than they thought they would. I'm running Yoper with Reiser4, certain things have more freedom to move fast than other things, depends on how stable you need the processes to be, google can't have a system wide failure, it's out of the question.
|My partner was right...should just keep my mouth shut... |
Yes - perhaps she recognised that you started the slagging :)
Look, after about 350 posts I don't think we are really any closer to determining why the 'Google Lag' exists. It has been said before and I will say it again. It is highly unlikely that this is intentional. All the indicators are that this is a defect in Google.
The Internet in it's present form is only what - perhaps six or seven years old? Why would any right minded SE think that denying their clients access to up to 10% of available sites was a valid action, especially when these sites are the newest and most current?
|Look, after about 350 posts I don't think we are really any closer to determining why the 'Google Lag' exists |
I felt the same as you back around 200 or so, but this bit in the last few days with re5earcher has rather interesting. I still believe the IPO played a role, just not a much as I previously thought.
<<< I don't think we are really any closer to determining why the 'Google Lag' exists
I feel like it's closer, can't speak for anyone else, other elements too have been educational.
I agree, there has been a lot of talk and good ideas thrown out about something that is clearly very significant. I can say I have benefited quite a bit by the thread.
Although all my new sites are still at the beach.
The 32 bit limit is immutable in the hardware.
Sorry, 32 bits is the size of the address bus for Pentium and older ia32 chips, and only limits the amount of memory addressable. Pentium Pro and newer chips have 36-bit address busses supporting up to 64GB of addressable physical memory.
The data bus reached 64 bits with the Pentium processor:
The main registers are still 32 bits, but internal data paths of 128 and 256-bits have been added to speed internal data transfers, and the burstable external data bus has been increased to 64 bits.
The Pentium Pro is able to pull 64-bit chunks of data straight off of its caches:
The power of the Pentium Pro processor is further enhanced by its caches: [...] a 256KByte L2 cache that's in the same package as, and closely coupled to, the CPU, using a dedicated 64-bit ("backside") full clock speed bus.
All of Google's ia32 equipment can handily shuffle about 64-bit values with no problem. By now they're probably retiring the last of their Pentium III chips. (Note that the MMX instructions use 64-bit registers.)
You'll also want to check out UFS2, which is a 64-bit file system used on FreeBSD, OpenBSD, NetBSD, and other BSD derivative operating systems. The 64-bitness of the filesystem has no noticable impact on speed.
UFS2 was developed and polished in about the same amount of time since I first heard of the '32-bit problem' theory. Since UFS2 was devolped in part by Marshall Kirk McKusick, the author of UFS and FFS (a really smart guy who knows his filesystems), and GFS was developed by a team of equally (or maybe even not-quite-as) smart guys, I doubt that this is a serious constraint for Google. In fact, I doubt that 32-bitness was ever a problem in GFS, since they knew it was going to be big all along.
PS: UFS2 probably had larger problems with UFS filesystem compatibility than expansion to 64 bits.
PPS: You don't need to jump to 64 bits from 32 bits. 36 bits is probably more than enough, and gives you a large on-disk savings.
| This 354 message thread spans 12 pages: < < 354 ( 1 ... 2 3 4 5 6 7 8 9 10 11  ) |