Forum Moderators: open
are there any reactions available for this article.
It is supposed to detect whether there is heavy interlinking between two sites and if it is found the pages value are discounted.
[webmasterworld.com...]
I think the maths level is beyond the members here (speaking for myself)
What it is looking for is a web graph (mapping of links and nodes) that is a subgraph that has incoming links, but no outgoing links to other components of the webgraph. This is identified in this paper as "rank hoarding" i.e. to get the benefit of the incoming links and not to share any of that benefit through outgoing links.
Essentially a web graph is "site" blind -- it does not see the difference between one domain and another. What it sees is a set of pages linked together. If those pages are heavily linked between them with little to no outgoing links, the algorithm detects such and discounts the value of those pages.
The definition of a well-linked page is one that out-links [outside of its subgraph] equal to or more than the incoming links arriving to its sub-graph.
So we have 24 pages in one site, and 24 pages in another site and they link 15 times on each page one to another creating 720 links between the sites. Now coming into this sub-graph we have 100 incoming links and 0 outoing links. This sub-graph will be demoted.
Have the same setup and have 100 incoming links from 50 different subgraphs and 200 outgoing links to 100 subgraphs and these two sites will not be demoted. The outgoing links and the incoming links save these two sites as they are properly mapping to other locations in spite of being heavily linked together. However, they gain no benefit from interlinking with each other so heavily. They would be better to reduce the interlinking and thus not create a sub-graph, and both sites would benefit from the links received from each other. They would need to reduce their 720 interlinks to less than all other incoming links for this to happen -- to about 80 shared links -- 40 each way and then they would form two smaller sub-graphs and receive benefit from each other.
I am asking is it complete and are we missing something? When is this algo being implemented or is it already started? For me it is a step in the right direction.
When is this algo being implemented or is it already started?
Difficult to say if "it" has started - and if so to what level/effect. Could only be on detection level - for manual interference? They say it could lead to spam detection, that does not necessarily mean Google automatically neutralises any effects.
Subdomains for example are seperate domains but seemingly will not get penalised for extensive cross-linking:
[webmasterworld.com...]
Which makes sense, I suppose they are treated as "site-internal links", or are discounted through a similar web graph check-up as the cited paper. If the latter is the case there could be neutralisation effects already in place regarding heavy interlinked domain farms and their possible ranking benefits due to that.
If you follow the amount of threads here on complaints of seperate interlinked domains I think a lot is still theory.
related: [webmasterworld.com...]
Also: possible discounting effects for "hoard-back-louping links":
[webmasterworld.com...]
If you step back and look at it from the practical point of view I have wondered what the penalty would be if this type of spam was detected? Let's say you have a model train site. There are only so many other model train sites you can link to and after a while they all form a closed loop of links to each other. This can happen easily for any number of subjects on the web. Is google going to ban all these sites? I don't think so. Would they penalize all the sites equally so in essence the serp's for these sites remain the same? Maybe.
I also find it totally mind-blowing that google would have the processing power to perform this type of calculation for the entire web. Almost as mind boggling that they would attempt to detect various hidden text tricks that involve text over gif's, css etc.
Google does a fairly good job at spam detection and they appear to go after the low hanging fruit on the spam tree. This makes sense from the business stand point. And the fact is that for the vast majority of searches the results are relevant, and that's all that Google really is trying to do.
Thanks for the post.. I'll certainly have a bash at reading it, but - vitaplease - I'm with you it looks like too much maths ;)
I mentioned a while ago that our sites and our clients' sites form a sort of page-rank-cul-de-sac and was wondering how google would look on this.. (Not for PR reasons, just cos we link to our clients in our portfolio and they have a 'designed by' link on their root pages.)
When is this algo being implemented or is it already started?
I would be surprised if Google don't do something along these lines already.. Since PDF's don't link out there must be a solution to dissapate their PR, and pages that don't link out (even small groups of pages) have a pretty profound effect on the distribution of PR after 20 itterations..
<snip/>
I'm gonna read the paper before I type a'thing else! :)
This is maths?!
In simple terms, does this mean;
1) It does not pay to have a number of cross linked sites, cross linking multiple times from multiple pages.
2) It does not pay to restrict outbound links from your site to the extent of excluding all others (outside your family of cross linked sites).
?
:-)
Their are countless important sites on the net which have vastly greater incoming links than outgoing.
But those sites don't interlink heavily. If MS, Ebay, and Drudgereport linked to each other thousands of times, then we would be looking at a subgraph that this algorithm would spot. But *just* receiving a lot more incoming links thatn the number of outgoing ones doesn't do it.
Do you really care if there is duplicate content after page 3? A link farm that only gets to page 6?
I think they are already able to catch lots of stuff, but they don't have the processing power to catch it everywhere. If they catch the big bad spammers in the top 1% of search terms, they will also be cleaning up a lot of the lesser searched terms. If it is 5% of the search terms, it will have even more of a trickle down affect.
It just won't catch the spammers that are specifically targeting keyphrases that are well below the majic line. "[business] [city] [country]" is not going to make it on the radar yet.
Looked greek to me!
lol! :)
Do they need to "penalize" per se? Could they not simply give those sites or groups of site an imageinary link for every cross-link..
so if every page on www.aaa.com links to every page on www.bbb.com then every page on www.aaa.com would also be given a link to zzz.com ... zzz.com could be some magic site that shares it's PR with every page in the index once per itteration.. .. you could maybe have some allowance of crosslinks per domain.. .. .. that way - over cross-linking would be a disadvantage, but unwhitting people wouldn't get unfairly or overly punished..
If something characterises their approach then it's one of interlinking extensively between multiple URLs and their pages (some over 50.000 incoming links). One of them is still present in ATW, the other already was deleted in ATW a while ago. But they were both in the Google database until a couple of days ago.
All of their domains have a grey PR-bar and present DMOZ-entries still remain in dmoz.org, but are also gone in the Google-version (in the cache of the DMOZ-Google-version they're not gone).
Bcoz i thought that google just didn't count those links and hence those pages don't come up in SERPS
> They were in the SERPS just a couple of days ago and now they're gone.
Another question is what happens if you link only one way extensively?
> Haven't seen something like that. In their projects everthing is crosslinking. Have however seen like the pic you posted; but site 2 in the example contains outbound links, but doesn't link back to site 1. The pages it links to still remain in the SERPS.
Apart from this , In the first read i coudnt understand much from the paper , after 5 years out from the engineering college i loosed my math touch :) ...
But after google , people try to minimize outbound external links thinking that will reduce the PR passed to their other internal pages ... Maybe after this new stuff is implemented this attitude will change again :)
"PageRank is defined as the stationary distribution of the Markov chain corresponding to the nXn stochastic transition matrix A^T" ;-)
LOL!
I will have to read it a few more times before I try to say what I think it means, but I think mil2k probably has it right. I think it's in the theoretical stages yet, and has yet to be implemented as a spam-detection technique.
The important part is the implications, particularly the definition of pagerank and the spam detection applications.
The first important implication, as gopi pointed out, is that this definition of pagerank almost certainly means that outbound links matter. If information about both inbound and outbound linking can be obtained from the eigenvectors of the Google matrix, then that means that information was used to construct the matrix in the first place. It is reasonable to speculate that the info will propagate through the calculation and show up in the PR. It is possible that this is not the case, and I doubt if you can know for sure without knowing how they construct the matrix, but it seems likely.
Unfortunately, "nobody can be told what the (Google) matrix really is, they just have to (calculate) it for themselves." ;-)
The second implication is that the days of link spammers and PR hogs are numbered. I don't think they've implemented this, but I think they will. Note however that they just say that they can identify "leaf nodes", meaning subgraphs which do not link out. Mil2k, how did you derive the interpretation that you should link out at least as much as you are linked in? I don't see that from this paper, but it may be stated elsewhere, I don't know. Anyway, unless you have a leaf node, you don't have to worry about this. This is presumably what GG meant when he said it is easy to spot a PR hog, although he seems to be using a definition of "easy" which is a little different from mine. However, when you have the kind of computing power that Google has, your definition of easy might be a little different from most people's.
The broader implication of this is the Google wants the web to be well connected, and will implement algorithms to encourage this. This makes sense, and is yet another reason why you should listen to Brett.
The fact that PR is the result of an eigenvalue calculation has all sorts of interesting implications. Among other things, it explains how they can calculate your PR and that of the pages linked to you at the same time, without knowing the PR of the other pages before hand. The PR of the whole web is calculated in one swell foop. Pretty awesome. No wonder the update is so long in coming!
So what does this mean about how you should build your site? I means bugger-all about how you should build your site. You should build your site for the users. That means, among other things, that it should be well-connected to the rest of the web. All this paper means is that eventually Google is going to be able to detect certain kinds of rank hoarding techniques and compensate for them. If you link out when appropriate, this doesn't apply to you, and if you don't, you should regardless of this paper.
It seems to frighten some of you, but indeed this is not difficult, and it is probably the best way to understand it, forget the transition formulas.
An excellent explanation of the meaning of this sentence is to be found in :
look especially at section 2-5
(this is one of the referrences of the article covered by this thread, referrence number 11 - I suppose this article is well-known to most of you, but I am new to SEO (amateurish) interest, and it was very instructive reading for me).
I shall try to sum up how PR algorithm works, as I have understood it throughout this article.
First one (grossly false) assumption : there is no webpage with no outgoing link (if you look at the transition formula, it means that the integer Nu appearing as a denominator is not zero).
Imagine one surfs on the web with the following rule : begin at a place at random, then click on random links. The proportion of times you go to page widgets.com/blue_widgets is simply the page rank of this page. The meaning of "Markov chains" and "stochastic transition matrix" is not more than that.
There are two obvious drawbacks to this algorithm. The first one is that indeed there are pages with no outgoing links, and you have to define a rule for your surfer when he falls there ; the second is a bit more subtle : there are sites with no outgoing link ; imagine that there is a link
from dmoz.org/Widgets to widgets.com/blue_widgets, one from widgets.com/blue_widgets to widgets.com/red_widgets and one from widgets.com/red_widgets to widgets.com/blue_widgets.
The consequence would be that a random surfer clicking on the link on dmoz would be trapped for his entire life on widgets.com. Indeed, any surfer would be trapped one day or another on some such selfish website, and would spend all but a finite number of days in his life on such sites ; consecutively PageRank of any philanthropic site linking to somebody else would be zero, and PageRank would be shared only between selfish sites. Not a very reasonable algorithm!
That is where the vector matrix E comes into consideration : the surfer throws a dice and, one time over six, does not follow a link but goes to a random page anywhere on the world wide web. This prevents PageRank sucking by selfish sites.
PageRank of any page is supposed to be the proportion of time spent on this page.
Since computation of such probabilities is very straightforward with some matrix iteration techniques, it seems totally unlikely that some "deep" modifications to this algorithm have been introduced since then. Of course, some details are not given in Google's brief description of the algorithm : for instance, if I put on a page A two links to page B and one to page C, does my surfer goes twice more often to B than to C? Are "dead-end" pages still treated as hinted in the evocated article (section 2-7).
Now I would be interested to know more (preferently not from "experimental speculation", but of some published article) of what E really is -but it might be the main secret of Google.
The intuitive meaning of E is easily explained on the random surfer model : if E is a constant matrix - a column with 3,083,324,652 entries, where each is 1/3,083,324,652 it would mean that the surfer goes equally to any page on the web when the dice throw say "do teleport".
On another extreme, the entries of E can be all zero except the one from, say www.dmoz.org which would be 1 ; it would mean that everytime the dice tells the surfer not to follow a link, he goes to the homepage of Dmoz. The most quoted formula about PageRanks postulates that E is constant, the first model. Now you should read carefully section 6 of the article, is that a uniform E is not without drawbacks. Other models are discussed in the article, with their respective plus and minuses, and probably the real E used by Google is something complex. As a pure guess, it would seem reasonable for instance to give some benefit in E to pages containing many outgoing links (this would be a way to compensate PR hoarding).
Does somebody here know more about the contents of the E column?