Second Eigenvalue of google matrix

Forum Moderators: open

Message Too Old, No Replies

Second Eigenvalue of google matrix

Reactions and analysis on this technical paper

mil2k

9:54 am on Mar 26, 2003 (gmt 0)

[dbpubs.stanford.edu:8090...]

are there any reactions available for this article.
It is supposed to detect whether there is heavy interlinking between two sites and if it is found the pages value are discounted.

vitaplease

10:05 am on Mar 26, 2003 (gmt 0)

also mentioned but barely discussed here:

[webmasterworld.com...]

I think the maths level is beyond the members here (speaking for myself)

mil2k

10:13 am on Mar 26, 2003 (gmt 0)

i got this explanation :-
The Second Eigenvalue of the Google Matrix identifies an important spam detection method built into the apparatus of Google.

What it is looking for is a web graph (mapping of links and nodes) that is a subgraph that has incoming links, but no outgoing links to other components of the webgraph. This is identified in this paper as "rank hoarding" i.e. to get the benefit of the incoming links and not to share any of that benefit through outgoing links.

Essentially a web graph is "site" blind -- it does not see the difference between one domain and another. What it sees is a set of pages linked together. If those pages are heavily linked between them with little to no outgoing links, the algorithm detects such and discounts the value of those pages.

The definition of a well-linked page is one that out-links [outside of its subgraph] equal to or more than the incoming links arriving to its sub-graph.

So we have 24 pages in one site, and 24 pages in another site and they link 15 times on each page one to another creating 720 links between the sites. Now coming into this sub-graph we have 100 incoming links and 0 outoing links. This sub-graph will be demoted.

Have the same setup and have 100 incoming links from 50 different subgraphs and 200 outgoing links to 100 subgraphs and these two sites will not be demoted. The outgoing links and the incoming links save these two sites as they are properly mapping to other locations in spite of being heavily linked together. However, they gain no benefit from interlinking with each other so heavily. They would be better to reduce the interlinking and thus not create a sub-graph, and both sites would benefit from the links received from each other. They would need to reduce their 720 interlinks to less than all other incoming links for this to happen -- to about 80 shared links -- 40 each way and then they would form two smaller sub-graphs and receive benefit from each other.

I am asking is it complete and are we missing something? When is this algo being implemented or is it already started? For me it is a step in the right direction.

vitaplease

11:44 am on Mar 26, 2003 (gmt 0)

mil2k, I think you are summarising that part quite well. I'm not clear about the implications of the "Accelerating PageRank Computations".

When is this algo being implemented or is it already started?

Difficult to say if "it" has started - and if so to what level/effect. Could only be on detection level - for manual interference? They say it could lead to spam detection, that does not necessarily mean Google automatically neutralises any effects.

Subdomains for example are seperate domains but seemingly will not get penalised for extensive cross-linking:
[webmasterworld.com...]

Which makes sense, I suppose they are treated as "site-internal links", or are discounted through a similar web graph check-up as the cited paper. If the latter is the case there could be neutralisation effects already in place regarding heavy interlinked domain farms and their possible ranking benefits due to that.

If you follow the amount of threads here on complaints of seperate interlinked domains I think a lot is still theory.
related: [webmasterworld.com...]

Also: possible discounting effects for "hoard-back-louping links":
[webmasterworld.com...]

mil2k

12:31 pm on Mar 26, 2003 (gmt 0)

Thanks Vitalplease for all those threads! Good info but all of those threads are based on assumptions and the paper i'm talking abt is very new (abt a fortnight old). Hence what i wanted was reactions based on reading this paper.

MrSpeed

2:09 pm on Mar 26, 2003 (gmt 0)

When I read the paper a few weeks ago I couldn't help but wonder if it was just a hoax. Googles way to have a little fun with webmasters. The paper seemed like total latin to me. I took quite a few calculus classes. However there are tons of math specialities which I haven't even heard of, so it could be I'm just dumb when it comes to matrix analysis.

If you step back and look at it from the practical point of view I have wondered what the penalty would be if this type of spam was detected? Let's say you have a model train site. There are only so many other model train sites you can link to and after a while they all form a closed loop of links to each other. This can happen easily for any number of subjects on the web. Is google going to ban all these sites? I don't think so. Would they penalize all the sites equally so in essence the serp's for these sites remain the same? Maybe.

I also find it totally mind-blowing that google would have the processing power to perform this type of calculation for the entire web. Almost as mind boggling that they would attempt to detect various hidden text tricks that involve text over gif's, css etc.

Google does a fairly good job at spam detection and they appear to go after the low hanging fruit on the spam tree. This makes sense from the business stand point. And the fact is that for the vast majority of searches the results are relevant, and that's all that Google really is trying to do.

yetanotheruser

2:12 pm on Mar 26, 2003 (gmt 0)

mil2k,

Thanks for the post.. I'll certainly have a bash at reading it, but - vitaplease - I'm with you it looks like too much maths ;)

I mentioned a while ago that our sites and our clients' sites form a sort of page-rank-cul-de-sac and was wondering how google would look on this.. (Not for PR reasons, just cos we link to our clients in our portfolio and they have a 'designed by' link on their root pages.)

When is this algo being implemented or is it already started?

I would be surprised if Google don't do something along these lines already.. Since PDF's don't link out there must be a solution to dissapate their PR, and pages that don't link out (even small groups of pages) have a pretty profound effect on the distribution of PR after 20 itterations..

<snip/>

I'm gonna read the paper before I type a'thing else! :)

GrinninGordon

2:27 pm on Mar 26, 2003 (gmt 0)

Hi Guys

This is maths?!

In simple terms, does this mean;

1) It does not pay to have a number of cross linked sites, cross linking multiple times from multiple pages.

2) It does not pay to restrict outbound links from your site to the extent of excluding all others (outside your family of cross linked sites).

?
:-)

mil2k

3:14 pm on Mar 26, 2003 (gmt 0)

The sense of humor in this forum is amazing. Half the time i am laughing and my boss thinks i'm reading jokes!
Abt the paper being maths, is it really? Looked greek to me!

mayor

4:15 pm on Mar 26, 2003 (gmt 0)

Guess I'll have to go back to school to grasp the math on this one. I don't perceive it as a gag, but nor do I perceive it a practical application to real-world indexing algorithms. But then, who cares? The thing one should absorb from this is that we should expect Google to develop algorithms to detect unnatural link structures and take corrective action against them.

Artful

4:20 pm on Mar 26, 2003 (gmt 0)

This scheme will never work. Their are countless important sites on the net which have vastly greater incoming links than outgoing. Examples off the top of my head would include Microsoft, Ebay and any government site e.g. the IRS.gov. These sites are often called "authorities" in other non-Google papers.

dwilson

4:30 pm on Mar 26, 2003 (gmt 0)

Their are countless important sites on the net which have vastly greater incoming links than outgoing.

But those sites don't interlink heavily. If MS, Ebay, and Drudgereport linked to each other thousands of times, then we would be looking at a subgraph that this algorithm would spot. But *just* receiving a lot more incoming links thatn the number of outgoing ones doesn't do it.

BigDave

4:50 pm on Mar 26, 2003 (gmt 0)

There is no reason for google to ever run their automatic spam checking on the entire web. All they need to do is run it on the first few pages of the most popular search queries.

Do you really care if there is duplicate content after page 3? A link farm that only gets to page 6?

I think they are already able to catch lots of stuff, but they don't have the processing power to catch it everywhere. If they catch the big bad spammers in the top 1% of search terms, they will also be cleaning up a lot of the lesser searched terms. If it is 5% of the search terms, it will have even more of a trickle down affect.

It just won't catch the spammers that are specifically targeting keyphrases that are well below the majic line. "[business] [city] [country]" is not going to make it on the radar yet.

yetanotheruser

6:44 pm on Mar 26, 2003 (gmt 0)

Looked greek to me!

lol! :)

Do they need to "penalize" per se? Could they not simply give those sites or groups of site an imageinary link for every cross-link..

so if every page on www.aaa.com links to every page on www.bbb.com then every page on www.aaa.com would also be given a link to zzz.com ... zzz.com could be some magic site that shares it's PR with every page in the index once per itteration.. .. you could maybe have some allowance of crosslinks per domain.. .. .. that way - over cross-linking would be a disadvantage, but unwhitting people wouldn't get unfairly or overly punished..

keesverpalen

7:07 pm on Mar 26, 2003 (gmt 0)

I think I've seen quite some proof of the Second Eigenvalue theory; 2 SEO-companies were deleted in Google in the past week (well after last update) with all their clients.

If something characterises their approach then it's one of interlinking extensively between multiple URLs and their pages (some over 50.000 incoming links). One of them is still present in ATW, the other already was deleted in ATW a while ago. But they were both in the Google database until a couple of days ago.

All of their domains have a grey PR-bar and present DMOZ-entries still remain in dmoz.org, but are also gone in the Google-version (in the cache of the DMOZ-Google-version they're not gone).

mil2k

4:32 am on Mar 27, 2003 (gmt 0)

Intersting point keesverpalen. Are you sure google gave them PR penalty? Bcoz i thought that google just didn't count those links and hence those pages don't come up in SERPS. Another question is what happens if you link only one way extensively? say from all the pages of one website to homepage of another website? I know that google counts them as individual links.

keesverpalen

9:54 am on Mar 27, 2003 (gmt 0)

it's not only a PR penalty; their listings have gone completely. Could also be that Google lost a part of their database, but that would be too much of a coincidence.

Bcoz i thought that google just didn't count those links and hence those pages don't come up in SERPS
> They were in the SERPS just a couple of days ago and now they're gone.

Another question is what happens if you link only one way extensively?
> Haven't seen something like that. In their projects everthing is crosslinking. Have however seen like the pic you posted; but site 2 in the example contains outbound links, but doesn't link back to site 1. The pages it links to still remain in the SERPS.

gopi

1:13 pm on Mar 27, 2003 (gmt 0)

One thing is pretty clear and it proves the "outbound links" point in Brett's classic 12 month post [webmasterworld.com]... Have a couple of non-reciprocative links to authorative sites in your industry, preferably directory type sites ( so they will also spread out, without forming a closed loop)...

Apart from this , In the first read i coudnt understand much from the paper , after 5 years out from the engineering college i loosed my math touch :) ...

mil2k

3:54 pm on Mar 27, 2003 (gmt 0)

I think brett thinks like search engines( sometimes ahead of them!) I say this because i read some 4 to 5 yrs old articles of brett and they were so amazing especially if you read them now. Thanks gopi for the thread. Excellent reading.

gopi

4:07 pm on Mar 30, 2003 (gmt 0)

mil2k , Before the days of Google "On-topic outbound links" always had a SEO value ... the old timers call this as a "Hub" model ( WW's resident expert about 'hub/authorities' is Paynt!) ...

But after google , people try to minimize outbound external links thinking that will reduce the PR passed to their other internal pages ... Maybe after this new stuff is implemented this attitude will change again :)

freejung

9:33 pm on Mar 30, 2003 (gmt 0)

Wow, what a great article! It provides a lot of insights into how Google computes PageRank, and also provides this great quote which you can use when explaining what Page Rank is to newbies:

"PageRank is defined as the stationary distribution of the Markov chain corresponding to the nXn stochastic transition matrix A^T" ;-)

LOL!

I will have to read it a few more times before I try to say what I think it means, but I think mil2k probably has it right. I think it's in the theoretical stages yet, and has yet to be implemented as a spam-detection technique.

gopi

10:49 pm on Mar 30, 2003 (gmt 0)

"PageRank is defined as the stationary distribution of the Markov chain corresponding to the nXn stochastic transition matrix A^T"

This phrase is really useful if you want to scare your clients and make them think that you are working in some hi-fi funda math :)

mil2k

1:40 pm on Mar 31, 2003 (gmt 0)

I have bookmarked this page to fend off all those pesky fellows who look down on my work(SEO). I will just say that i decode this algorithm and derive equations from it so that it can help me in better placement of my site in the google search engine. Then let them crack their heads ;-)

freejung

9:27 pm on Mar 31, 2003 (gmt 0)

OK, so the math isn't really that bad if you know a lot of linear algebra, but it is beyond me and also irrelevant. All the actual math in this paper is just a detailed proof of the theorem they state in their abstract. Unless you don't believe them, there is no reason to even read this part. Let the guys at Google check the algebra.

The important part is the implications, particularly the definition of pagerank and the spam detection applications.

The first important implication, as gopi pointed out, is that this definition of pagerank almost certainly means that outbound links matter. If information about both inbound and outbound linking can be obtained from the eigenvectors of the Google matrix, then that means that information was used to construct the matrix in the first place. It is reasonable to speculate that the info will propagate through the calculation and show up in the PR. It is possible that this is not the case, and I doubt if you can know for sure without knowing how they construct the matrix, but it seems likely.

Unfortunately, "nobody can be told what the (Google) matrix really is, they just have to (calculate) it for themselves." ;-)

The second implication is that the days of link spammers and PR hogs are numbered. I don't think they've implemented this, but I think they will. Note however that they just say that they can identify "leaf nodes", meaning subgraphs which do not link out. Mil2k, how did you derive the interpretation that you should link out at least as much as you are linked in? I don't see that from this paper, but it may be stated elsewhere, I don't know. Anyway, unless you have a leaf node, you don't have to worry about this. This is presumably what GG meant when he said it is easy to spot a PR hog, although he seems to be using a definition of "easy" which is a little different from mine. However, when you have the kind of computing power that Google has, your definition of easy might be a little different from most people's.

The broader implication of this is the Google wants the web to be well connected, and will implement algorithms to encourage this. This makes sense, and is yet another reason why you should listen to Brett.

The fact that PR is the result of an eigenvalue calculation has all sorts of interesting implications. Among other things, it explains how they can calculate your PR and that of the pages linked to you at the same time, without knowing the PR of the other pages before hand. The PR of the whole web is calculated in one swell foop. Pretty awesome. No wonder the update is so long in coming!

So what does this mean about how you should build your site? I means bugger-all about how you should build your site. You should build your site for the users. That means, among other things, that it should be well-connected to the rest of the web. All this paper means is that eventually Google is going to be able to detect certain kinds of rank hoarding techniques and compensate for them. If you link out when appropriate, this doesn't apply to you, and if you don't, you should regardless of this paper.

mil2k

5:58 am on Apr 1, 2003 (gmt 0)

Unfortunately, "nobody can be told what the (Google) matrix really is, they just have to (calculate) it for themselves."
REALLY COOL DIALOGUE..
"Mil2k, how did you derive the interpretation that you should link out at least as much as you are linked in?"
I don't know anything abt maths . The point i said was discussed in one of the other forums where the discussion was about link penalties for heavy crosslinking between two sites. This was one of the solution offered to counter the penalty of escaping the formation of sub graphs.
"So what does this mean about how you should build your site?"
Link out but be moderate. Not too much and not too little.(Unless you are an authority).
That's the way i see it.(Not necessarily RIGHT)

french tourist

9:12 am on Apr 1, 2003 (gmt 0)

"PageRank is defined as the stationary distribution of the Markov chain corresponding to the nXn stochastic transition matrix A^T"

It seems to frighten some of you, but indeed this is not difficult, and it is probably the best way to understand it, forget the transition formulas.

An excellent explanation of the meaning of this sentence is to be found in :

[dbpubs.stanford.edu:8090...]

look especially at section 2-5

(this is one of the referrences of the article covered by this thread, referrence number 11 - I suppose this article is well-known to most of you, but I am new to SEO (amateurish) interest, and it was very instructive reading for me).

I shall try to sum up how PR algorithm works, as I have understood it throughout this article.

First one (grossly false) assumption : there is no webpage with no outgoing link (if you look at the transition formula, it means that the integer Nu appearing as a denominator is not zero).
Imagine one surfs on the web with the following rule : begin at a place at random, then click on random links. The proportion of times you go to page widgets.com/blue_widgets is simply the page rank of this page. The meaning of "Markov chains" and "stochastic transition matrix" is not more than that.

There are two obvious drawbacks to this algorithm. The first one is that indeed there are pages with no outgoing links, and you have to define a rule for your surfer when he falls there ; the second is a bit more subtle : there are sites with no outgoing link ; imagine that there is a link
from dmoz.org/Widgets to widgets.com/blue_widgets, one from widgets.com/blue_widgets to widgets.com/red_widgets and one from widgets.com/red_widgets to widgets.com/blue_widgets.

The consequence would be that a random surfer clicking on the link on dmoz would be trapped for his entire life on widgets.com. Indeed, any surfer would be trapped one day or another on some such selfish website, and would spend all but a finite number of days in his life on such sites ; consecutively PageRank of any philanthropic site linking to somebody else would be zero, and PageRank would be shared only between selfish sites. Not a very reasonable algorithm!

That is where the vector matrix E comes into consideration : the surfer throws a dice and, one time over six, does not follow a link but goes to a random page anywhere on the world wide web. This prevents PageRank sucking by selfish sites.

PageRank of any page is supposed to be the proportion of time spent on this page.

Since computation of such probabilities is very straightforward with some matrix iteration techniques, it seems totally unlikely that some "deep" modifications to this algorithm have been introduced since then. Of course, some details are not given in Google's brief description of the algorithm : for instance, if I put on a page A two links to page B and one to page C, does my surfer goes twice more often to B than to C? Are "dead-end" pages still treated as hinted in the evocated article (section 2-7).

Now I would be interested to know more (preferently not from "experimental speculation", but of some published article) of what E really is -but it might be the main secret of Google.

The intuitive meaning of E is easily explained on the random surfer model : if E is a constant matrix - a column with 3,083,324,652 entries, where each is 1/3,083,324,652 it would mean that the surfer goes equally to any page on the web when the dice throw say "do teleport".

On another extreme, the entries of E can be all zero except the one from, say www.dmoz.org which would be 1 ; it would mean that everytime the dice tells the surfer not to follow a link, he goes to the homepage of Dmoz. The most quoted formula about PageRanks postulates that E is constant, the first model. Now you should read carefully section 6 of the article, is that a uniform E is not without drawbacks. Other models are discussed in the article, with their respective plus and minuses, and probably the real E used by Google is something complex. As a pure guess, it would seem reasonable for instance to give some benefit in E to pages containing many outgoing links (this would be a way to compensate PR hoarding).

Does somebody here know more about the contents of the E column?

freejung

4:54 pm on Apr 1, 2003 (gmt 0)

Welcome to ww French Tourist! This has to be the best first post I've ever seen! Wow!

Thanks for all the incredibly useful info and insights. Everybody should read this post, and it should be in the Google FAQ IMHO.

MrSpeed

5:24 pm on Apr 1, 2003 (gmt 0)

Oh my god my brain hurts!
I have to go do some reading in a quiet corner.