| This 63 message thread spans 3 pages: 63 (  2 3 ) > > || |
|Flattening Effect of Page Rank Iterations - explains the "sandbox"?|
I have had my new sites rank well initially, then drop.
Here is what I think is happening, which is what I call the flattening effect of PageRank iterations.
Note the PageRank equation (sans filters) is:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) .
The first observation about this equation is that it can only be calculated after a statistically significant number of iterations.
If you analyze a site with 5 pages that all link to each other (the homepage having an initial PageRank of roughly 3.5), what you see in the first iteration of PageRank is that the homepage is PR 3.5, and all other pages are PR .365 – the largest PR gap that will ever exist through multiple iterations in this example.
This homepage PR represents a surge in PR because Google has not yet calculated PR distribution, therefore the homepage has an artificial and temporary inflation of PR (which explains the sudden and transient PR surge and hence SERPs).
In the second iteration, the homepage goes down to PR 1.4 (a drop of over 50%!), and the secondary pages get lifted to .9, explaining the disappearing effect of “new” sites. Dramatic fluctuations continue until about the 12th iteration when the homepage equilibrates at about a lowly 2.2, with other pages at about .7.
I believe that the duration of the “sandbox” is the same amount of time it takes Google to iterate through its PageRank calculations.
Therefore, I think that the “sandbox” is nothing other than the time it takes Google to iterate through the number of calculations uniquely needed to equilibrate the volume of links for a given site.
The SEO cynic will ask “but my site withstood the ‘sandbox’, so it can’t exist!’”.
Revisiting the equation, sites CAN withstand the flattening effect of the PR iteration with optimized internal link structures (that don’t bleed PR but rather conserve them) OR have an active inbound PR feed to central distributions of PR.
[ The first observation about this equation is that it can only be calculated after a statistically significant number of iterations. ]
On the head:
That is why some sector players burn through tens of thousands of domain names in year.
>The first observation about this equation is that it can only be calculated after a statistically significant number of iterations.
There ya go.....spend $10K to $20K on throw-a-way domains every year, while looking for those few that stick ;)
On day one of any marketing course you should be taught this......"It is all about numbers.....the more irons you put in the fire, the better chance that one will get hot"!
[edited by: jatar_k at 1:56 am (utc) on April 29, 2006]
The PR calculation is a good theory. grant. I like that and it makes sense.
My thoughts on the sandbox run like this.
A new site enters the index near the top of the serps to see how it does. Is the site sticky? Does a visitor click on your site only to immediately hit the back button?
I think that the longer a visitor stays on your site, google sees it has value. If it has value, the site stays in the serps, if it has no value...bye bye
I also think timimg plays a part in relation to traffic flow. For instance if you put up a new site about Easter, right before Easter (causing massive traffic to it) it will not get sandboxed. If however at the same time you put up a site about snowshoes, and it gets no traffic, it gets dumped.
Good observations on PR, Grant. It's refreshing to read an intelligent analysis of the Google algorithm.
Google has been using a different approach to PR calculation for a while now -- one that allows a nearly continual re-calculation of Page Rank on the back end (not shown on the toolbar). With the advent of the BigDaddy infrastructure, it looks like this newer approach was not possible until they rebuilt everything they used to have going, hence all the weird PR observations we see right now.
While PR calculation may be a part of the sandbox effect, it's not the whole thing, in my view.
So, according to your formula, is PR directly related to the sandbox? That is, sites showing a "normal" PR of 3 and above (index page) should not be considered sandboxed?
[edited by: tedster at 6:30 am (utc) on April 28, 2006]
Grant, what do you think would be the effect on the index page of only having one link out to a site map of say 100 pages but every one of those pages linking back to the index page?
Grant, good post. I also read an explanation in some research paper published recently which says that fresh sites with very less no.of incoming links fare well with Google PR.
[edited by: tedster at 6:58 am (utc) on April 28, 2006]
This idea of grant's has been circulating in my brain for a bit, now, and I keep thinking about the effect of getting DEEP links, not just home page links.
Seems to me that deep links could help to short circuit the flattening effect that PR iterations might produce, especially if they were added at decent intervals. As the PR calculation cycled around the web, it would keep banging into these new links to the domain's deeper pages, and that would act like an infusion rather than a flattening.
I don't have the bandwidth to attempt a simulated caluculation, but does this idea sound possible? And does anyone have the experience of deep IBLs keeping a domain from suffering the sandbox effect? I do have one example where deep links seemed to end the effect for a client.
Without any explanation of what the various symbols are in the proposed PR equation (T1, Tn, d, C), it's hard to absorb this idea. Also, I thought it was pretty well established that PR is a logarithmic scale? I don't see any logs in the equation, so I'm not sure how valid it is.
Your example is pretty specific, so may not be that representative - i.e it doesn't take into account any links from external sites.
And maybe I misunderstood, but you seem to go from saying there is a large initial gap in PR between homepage and other pages on the site, to saying the home page has a surge in PR. I don't quite follow this - an initial PR value of 3.5 doesn't sound like a very big value. Can this really explain any observed surge in rankings?
For the record, I personally don't believe any "sandbox" exists - at least, I haven't yet seen consistent, detailed explanations of what the "sandbox" is (i.e. how you can tell, definitively, whether a site is "in" or "out"), and I've yet to see any hard evidence to support the sandbox ideas.
Very interesting hypothesis Grant.
If what you say is correct, all else being equal, the effect of what we see as a "sandbox" is likely to increase in terms of the time that the effect lasts.
Also, let's not forget that this is an effect which was born pretty much overnight. In other words, assuming you're right for the moment and this is technically a PR iteration "wait", something has happened which has quite radically slowed down googles process of iterating this PR calculation.
I can't help but wonder what that might be - it's certainly not a lack of processing power. You have to begin to consider whether actually the entire PageRank system of old has been replaced with something entirely different.....
Howard - I can't comment on your post although I read it with interest and curiousity (my maths isn't up to scratch to fully understand the details of the formula ;)). But there is no question (sheer volume of discussion tells you this) that some effect does exist and happens in certain circumstances. I would agree that calling it a "sandbox" is incorrect - unfortunately, we're stuck with that term now.
I think you're looking for hard evidence where none exists. All speculation about "what it is" is just that - speculation.
Sometimes when you get a bunch of people discussing what something might be, speculating, and people relaying their own experiences, you can collate enough data to make it worthy of discussion.
That's why I love threads like this - even if we never get to the bottom of the original question of "what is the sandbox", we're bound to learn something.
> Therefore, I think that the “sandbox” is nothing other than the time it takes Google to iterate through the number of calculations uniquely needed to equilibrate the volume of links for a given site.
I doubt this, because this would mean that google proceeds each iteration-loop every 48 hours or so, since many people reported a duration of several weeks or months for "their" sandbox. And that would tremendously violate the very nature of the iteration-algo, which assumes a fixed and limited set of links as a starting point and thus has to be run all in one action. Or did I completely misunderstand what you meant?
> ...the entire PageRank system of old has been replaced with something entirely different.....
this, however, seems very likely. The question is: with what? I also believe what tedster said, namely that some continuous calculation takes place, and such an algo definitely has to be completely different from what we knew. I guess the xml-sitemap-priority-keys play an important role for that new algo, but I absolutely have no idea how that could be tested.
Interesting, however, if I understand the theory correctly, it requires the existence of a designed "sandbox" policy (squash sites until PR stabilises). I believe GG has said that it's an effect not a policy. Now, I'm not so naive as to believe everything that GG has to say, but I am inclined to believe him on this one.
Good to be talking about such things. In the old days, I was under the impression that the monthly update cycle was due to it taking three weeks to go through all of the iterations to resolve PageRank. Certainly PR is calculated differently these days but as TrillianJedi notes why would it slow down so much? I'm on the side of the complete replacement of PR as the basis of ranking (although I think it does play a part later)
My thoughts on the Sandbox centre around the fact that Google is no longer based on the page but on the Search Phrase. In the Florida update we could compare the two radically different sets of results using the -asdf string. The key seemed to be that the new results were not a filtered subset of the old PR based results but an entirely different set of results (obviously with some results in common but not in terms of ordering). These search phrases were termed 'money phrases' because they were mainly centred on phrases ttargetted by commercial sites. The number of phrases covered expanded massively in Aug 2004 and took in a lot of non-money phrases. In addition, there was a 'site filter' applied that would suddenly reduce all of a site's pages rankings massively - for all of the search phrases that Google recognised.
Thinking along these lines, you can see a possible explanation for how the Sandbox works. You create a site, at first the pages are 'folded in' to the results. At some point between a few days and a few weeks, they are given a more permanent ranking within the search phrase. It is then that I see your ideas being relevant. That calculation of 'do we trust this site' based maybe on Trustrank, nature of links, age of links, power of links, similarity of pattern of links to spam networks, or whatever is done and only then is the traditional Algo including PageRank applied. The idea that this 'trusting of sites' relates to a single iteration of some set of factors is very interesting. It used to be thought that 'search phrase' ranking (such as Latent Semantic Indexing and other similar) was beyond current computing power for a large number of phrases. But some of the people who wrote papers on such subjects ended up working at Google.
Thanks for stretching my brain a bit, this morning!
|if I understand the theory correctly, it requires the existence of a designed "sandbox" policy |
I don't think that's what Grant is suggesting (although I maybe wrong). He hasn't said that there's a "filter" in place for a given set of circumstances (lack of accurate PR), but that the effect of too few iterations of the PR formula results in a much lower PR value than should exist for a given page.
So not a deliberate filter, just a consequence of the "early days" of a calculation that's not yet finished.
Interesting commments Iguana - I think you might be suggesting the existence of two indexes? We've had that discussion previously I think - back in supporters. I'll try and dig the thread up.
|Revisiting the equation, sites CAN withstand the flattening effect of the PR iteration with optimized internal link structures (that don’t bleed PR but rather conserve them) OR have an active inbound PR feed to central distributions of PR. |
internal linking, which convserves the PR sitwide gets kind of complicated, if you take 5000 instead of 5 pages!
the only practical way (which I went and suceeded with) seems to be a serious amount of deeplinks plus a serious amount of authority links, which are very hard to get, even if your site is worth it!
I really don't thank that the PageRank iterations could explain the sandbox effect. A couple of reasons:
- Why would this suddenly occur now? Google has been using PageRank since day one, but the sandbox is quite a recent phenomenon.
- The beauty of PageRank is that it's really simple mathematically - calculating each iteration does not take much processing power, and can be done quite easily on quite low end PCs. The PageRank is only output once the calculation has been iterated such that the PRs no longer change, so there really shouldn't be a "flattening effect". Instead, the bottleneck of search engines like Google is on the crawling side - the web is massive.
Instead, I think the SandBox effect relates to the use of domain name age and link maturation in ranking pages for competitive key terms. I encouraged two of my friends to recently start reselling my product. They have both set up websites, and are ranking reasonably well in Google for uncompetitive keywords. However, ranking for competitive keywords requires time, patience, and a lot of links from relevant authority sites.
I think tedster is right. From my understanding, PR can be calculated by the iterative technique you describe, but it can also be calculated by taking the eigenvalues of the global link matrix. Diagonalizing this matrix gives you the exact PR of every page in one step. I'm pretty sure G has been using this matrix technique for a long time.
That's a heck of a big matrix to diagonalize, but fortunately it's quite sparse.
I have long suspected that the entity originally known as "freshbot" uses some sort of iterative technique to guess PR between big diagonalizations, which could explain the initial jump effect, but this is pure speculation.
[Edit: I should note that some of the algorithms for guessing the eigenvalues of a large matrix are indeed iterative, but there are ways of doing the exact calculation and I think G has found a way of doing it efficiently with very large sparse matrices, possibly even a novel method most don't know about. In any case I strongly suspect they are using an algo much more subtle than the simple iterative formula given here.]
[edited by: freejung at 2:57 pm (utc) on April 28, 2006]
"Also, let's not forget that this is an effect which was born pretty much overnight. In other words, assuming you're right for the moment and this is technically a PR iteration "wait", something has happened which has quite radically slowed down googles process of iterating this PR calculation. I can't help but wonder what that might be - it's certainly not a lack of processing power. You have to begin to consider whether actually the entire PageRank system of old has been replaced with something entirely different....."
I thought the same thing. Why would the sandbox take months for some? Subsequent iterations should not be slowed down. I have no clue but I had this fanciful thought that maybe the iteration gears are halted while some trustrank quotient is determined and then injected into the process.
There must be at least one other component to the sandbox effect, because we see it on some keyword searches and not others -- and PR is not related to content.
|Howard Wright: Without any explanation of what the various symbols are |
in the proposed PR equation (T1, Tn, d, C), it's hard to absorb this idea.
That is the actual published PR equation, so it isn't just a proposed equation. T1 through Tn denotes the pages that link to the page being rated.
C(Tn) is the total number of links on the page Tn
d is a constant, a "damping factor" that insures that the total of all PageRank is bounded and does not go to infinity as the number of pages grows. In the original PageRank paper it was set to .85
grant/others who think this is viable theory: What do you suppose changed in the way that G calcuates PR, around the time of spring 2004, that changed the environment such that new site "A" went from ranking well in a matter of weeks and staying there, to a situation where very similar new site "B" did not rank decently in the SERP's for roughly 12 months, give or take three months?
|That is the actual published PR equation |
This is the formula published, yes, and it is _one_ way of calculating the eigenvalues of the matrix, but it is the simplest way and G certainly uses a more efficient formula to do the actual calculation. Also, I should say there was an inaccuracy in my earlier statement, I looked it up, all the major algos for computing the eigenvalues of a large sparse matrix are iterative, there are several major competing techniques.
Good ones converge fairly rapidly, and one of the cool things about a good algo is that you can estimate the error at each iteration, so you can make sure you run it until it converges sufficiently. At a given step you don't know exactly what the real value is, but you can give a pretty good estimate of how close you are to it.
In other words, I really doubt that G would use any value for PR, even if it is only an estimate based on fresh crawl data, that isn't close enough that you will ever notice the difference.
I don't think this accounts for the sandbox, but it's an interesting idea nonetheless.
Good discussion. Grant’s observations make sense to me, however when speaking of the ‘sandbox’ phenomenon, as some other members already mentioned, I think that PageRank is only one ingredient in the recipe. Here are some interesting reads on PageRank and TrustRank:
Combating Web Spam with TrustRank [dbpubs.stanford.edu]
Extrapolation Methods for Accelerating PageRank
Deeper Inside PageRank [meyer.math.ncsu.edu]
For even further reading I recommend following up on the work cited in the references and reading up on those as well.
This theory is easy to disprove, because just new domains are affected, but not new pages on existing domains. Also, not all new websites are affected in the same way. Also, if this theory would be true, Google would perform less than 1 iteration a month. However, the calculation of one iteration should take less than a day.
|From my understanding, PR can be calculated by the iterative technique you describe, but it can also be calculated by taking the eigenvalues of the global link matrix. Diagonalizing this matrix gives you the exact PR of every page in one step. I'm pretty sure G has been using this matrix technique for a long time. |
There are better and faster methods than diagonalizing the transition matrix. Also, I prefer seeing PR calculation (for d < 1) as a linear set of equation and therefore a matrix inversion and not an eigen value problem. In this case the usual formula (given is the first post) is nothing else than using the Jacobi iteration scheme for inverting the matrix. Better methods (iteration schemes for matrix inversion) are minimal residue, Gauss-Seidel, Blocking techniques an so on. Of course, all methods lead to intermediate results (while the final results are the same).
Nice notes Grant
I have a one page 'new' site thats ranks really well for several very competitive terms. Its just 800 words, a phone number and a location map. It ranked well immediately and it still does 6 months later. I'm scared to touch it by adding pages and this theory would explain its behaviour.
|Therefore, I think that the “sandbox” is nothing other than the time it takes Google to iterate through the number of calculations uniquely needed to equilibrate the volume of links for a given site. |
Bonus question: How many iterations (and how much time) does it take for Pagerank to converge?
|grant/others who think this is viable theory: What do you suppose changed in the way that G calcuates PR, around the time of spring 2004, that changed the environment such that new site "A" went from ranking well in a matter of weeks and staying there, to a situation where very similar new site "B" did not rank decently in the SERP's for roughly 12 months, give or take three months? |
Also, looking at the theory being presented, how is sitewide Pagerank being affected by near-dups in sub-directories now being knocked out of the index?
At this point in time, I don't think any "site-wide big picture" is totally divorced from what's happening with duplicate or near-duplicate detection within sites. It may or may not be an issue with "sandboxing" but if we're talking about sitewide PR to any degree whatsoever, then it isn't unrelated as far as overall calculations are concerned.
caveman, remember this thread and the other that you referenced? And the paper referenced?
That is an interesting situation JudgeJeffries. A few questions regarding the one page site if you don't mind answering them:
Does the one page site have outbound links? What kind of inbound links does the one page site have? You said the terms it was ranking for were competitive, how competitive is that in terms of the number of pages google shows for the queries?
I love tedster's comments above on linking.
Marcia, you read my mind. ;-)
IMO whille it's always fun and interesting to have fresh look at an old issue (or in this case two issues - PR, and the algo elements collectively refereed to as sandboxing) ... and I love that grant's theory is stirring things up ... it doesn't really work for me personally. If you calculate PR once and then do it again 12 months later, you're doing it on two different Webs; a lot changes in 12 months.
Besides, the sandbox and it's introduction (spring '04) and evolution coincide very well with what we know about certain papers regarding BR/LR and when they might have been able to be implemented in some fashion, and also dovetails nicely with the abundant evidence around the Web for the last 18 months or so that dup and near-dup issues play an important role at various levels of assessment. (That dang butterfly site is still giving me fits.)
Then, throw in some domain and link aging filters, and you got a thoroughly modern algo! ;-)
Anyway, as for the PR and sandboxing stuff, what the heck do I know, since AFAIK, there is no sandbox; just a series of algo and filter elements that prevent most but not all sites from ranking well for the first 3, 6, 9 or 12 months of their existence. :P
| This 63 message thread spans 3 pages: 63 (  2 3 ) > > |