PageRank Leakage and Large Directory Sites

Forum Moderators: open

Message Too Old, No Replies

PageRank Leakage and Large Directory Sites

Why doesn't PR leakage seem to make much difference?

Advenlo

9:53 am on Apr 22, 2004 (gmt 0)

OK, still struggling with PR. My understanding is:

Each page leaks PR when linked to another Page
PR can be recycled in a closed system
Where system is open PR leaks to external page and bye bye.
PR retained is (1-d) where d is the damping factor, often assumed to be 0.85
Hence about 15% is retained on each iterative cycle

However, there appear to be a large number of directory sites with huge PR (7+) and thousands of external links. I’m not thinking of the Yahoos of this world, where the PR benefit of having sites linked to them might make up for this lost. But what about the smaller directories, how do they manage to have PR7s on so many pages with what in theory should be huge PR losses from the external links?

Sorry if I’ve missed an obvious point, as I’m still on quite a steep learning curve. Thanks in advance for any pointers in the right direction

ciml

12:04 pm on Apr 22, 2004 (gmt 0)

The slope of the Toolbar scale is steep. There have been discussions before on the logarithmic base of the scale (or of the coefficient at base x).

Effectively, it comes down to this question: A PR(n) link from a page with 1 link is the same as a link from a PR(n+1) page each with how many links?

Imagine a Web directory with PR given only to its home page and no links back up (other than to the home page). If the number of subcategories per category matches the answer to the question I ask above, then we can expect PageRank to cascade through the levels, reducing by exactly one notch per level.

If this was a site with a pyramid structure and no external links, where each bottom level page links to the home page, then each page on the site would have more PageRank than in the directory case. The additional amount of PageRank given to each page would be increased by the same amount, so the distribution of PR through the site would be the same as the imaginary Web directory.

So, why so many PR7s in a directory site? Because the site has a lot of PR and it takes a lot of dilution to show on the Toolbar scale.

Advenlo

12:20 pm on Apr 22, 2004 (gmt 0)

Thanks for the reply but I think I've missed a step. Each page starts with a low PR when it comes into existence does it not?

The accumulated PR on any site can only come from (i) internal links or (ii) external links. As these directories don't appear to have the quantity of external links to justify such huge PR and have diluted their own internal links PR I don't understand how they can obtain so many high PR pages.

But if my assumptions in the first question were valid I guess there must be more external links than expected, even if they don't request link backs.

ciml

1:12 pm on Apr 22, 2004 (gmt 0)

For the same reason as the description above, a steep Toolbar PR scale, the accumulated PR from the rank source is negligible in almost all imaginable cases.

So I would expect the PR to come from links; not necessarily to the home page though.

WebFisher

3:17 pm on Apr 22, 2004 (gmt 0)

A cleanly interlinked medium-scale directory which is deemed *reputable* by Gbot (which is reflected in the frequency of site updates, preferrential treatment in SERPS and other things) can easily gain good PR because it receives links back from a wide chain of deep level internal pages. The entire idea of multi-iterative PR calculation suggests the idea of cross-linking within the site because the PR that you get to home from external links is then distributed and multiplied as it passes thru the site.

doc_z

8:50 pm on Apr 22, 2004 (gmt 0)

PR retained is (1-d) where d is the damping factor, often assumed to be 0.85
Hence about 15% is retained ...

In the original algorithm there is indeed a 'self-contribution' of (1-d) for each page. However, this is an absolute value (for the contribution to the real PR) and not a relative factor. Also, even if the target pages are benefiting from your links (i.e. they receive d*PR) doesn't mean that you're loosing PR (compared to the situation where the page is a dead end).

... is retained on each iterative cycle

The entire idea of multi-iterative PR calculation ...

The whole PR system has nothing to do with iteration. The originally described iteration scheme is just a technique to solve the equation system, i.e. a method to get numerical values close to the exact solution in a computationally not expensive way.

Ciml explained Google's current implementation of the PR algorithm very well. Indeed, the self-contribution is negligible. For practical purpose this means that you need high PR to get a large site spidered. However, this is different from the original algorithm. In case of the original algorithm you would produce a significant amount of PR if you have a large site.

ciml

12:09 pm on Apr 23, 2004 (gmt 0)

doc_z brings up one of the Big Questions I have about PageRank: Why is the (1-d) factor so small?

In a Web of fifty pages, you'd expect rank source to be noticeable, yet in the 4 billion page Web it's negligible as doc_z points out.

In a 4 billion page Web with small communities of interest (say averaging fifty pages each), the rank source would be noticeable. Only because the Web is so heavily connected does PageRank concentrate on so few high PR sites.

Well linked pages among a Web of obscurity could be compared to dense planets in a solar system of near vacuum. I think.

BigDave

5:21 pm on Apr 23, 2004 (gmt 0)

Pages do not leak PR. When people talk about PR leak, they are referring to a *site* leaking PR.

Or more precisely, the page does not lose any PR by linking to other pages, but if there is a link to someone else's page, the PR that is passed to that page is not added back into your own site.

Even large directories feed a huge amount of PR back into their own system. As an example, I will go with my own DMOZ category.

It is PR7, and there are only 8 sites listed. WooHoo! that should be a real motherload!

Except that there are 66 links on that page. Including all the listings and other copies of the directory and search engines that they point to, there are 24 links off the site, and 42 that are recirculating PR back into the site.

And you may notice that most of the top levels of directories have very few external links, so they only "leak" a small amount of PR.

The links coming into these directories more than makes up for what they leak.

doc_z

7:49 pm on Apr 23, 2004 (gmt 0)

Why is the (1-d) factor so small?

I don't think that the large number of pages which are spidered by Google is the reason that the (1-d) factor doesn't play a role. Independent from the number of pages, the average PR is lower or equal to 1 (equal in case of no dead end) according to the original algorithm. Therefore, one would expect that for most of the pages this contribution plays a role. However, measurements show that the factor is negligible.

Also, assuming that Google still use the original algorithm results in more problems. A page which should have a PR of one (according to the original algorithm) shows a toolbarPR of zero. Combining these results with those for measurements of the logarithmic scale leads to a PR for the highest page which is greater the total PR of the system. Another problem results from the fact that it seems that you need some PR to get a page spidered. Examinations show that pages which should have a PR of one are not spidered. Of course, for a system where the average PR is lower or equal to one this won't work.

The conclusion is that Google has changed the algorithm - to prevent generating PR by generating billions of pages and to get better results for the SERPS.

Of course, setting the self contribution from (1-d) to zero or a smaller value won't explain the results. The first one lead to PR=0 for all pages, the latter one is just a rescale of the whole system.

rharri

8:39 pm on Apr 23, 2004 (gmt 0)

Was recently asked to exchange links with an on-topic directory. Google toolbar showed good PR. Links looked clean in IE and Firefox. Happened to have Opera open and looked and saw this:
javascript:location='ht'+'tp://www.snoggledrip.htm'.

I'll bet 90% of this directory's recips don't know that they are getting no benifit from their link.

Powdork

9:12 pm on Apr 23, 2004 (gmt 0)

also sometimes you will see they are just reloading a full page frame into the frameset of the home page, which holds the high pr. Also sometimes the the inner pages will be the home page plus a query string.

Powdork

12:04 am on Apr 24, 2004 (gmt 0)

I'll bet 90% of this directory's recips don't know that they are getting no benifit from their link.

or they are getting traffic, and don't care.

seomike2003

3:53 am on Apr 24, 2004 (gmt 0)

Wow I come in here and see all these equations you'd think we're in the presence of all the google engineers :)

Anyway to ignore what everyone said he he I'll just answer what the main question was.

>>Why doesn't PR leakage seem to make much difference?

Well after finishing the structure of a rather large directory as a side project and populating the thing with some pretty hefty sites I was amazed that after a few months it only went to a page rank of 2. Ha! LOL. So I double checked the crosslinking making sure that I dotted all my i's and crossed all my t's. Still nothing major. I added more sites and it still didn't build any PR. So I decided to drop some hefty links into it to see what it did.

To decribe what happend it was like shooting a laser into a hall of mirrors LOL. I put a heavy PR link into a subcategory 4 levels deep and PR bounced up to 5 and started to go up and down through the categories. I put a power house link into the index and PR shot through the first couple of categories like wild fire.

Now Technically a directory is nothing more than a link farm so what keeps it from being fodder for a filter. Crosslinking everything correctly. More outbound than inbound linking. That makes it a hub which is a good thing :) Now we can quote algorithims and the orginal Standford papers by G's founding fathers but those are years old and have probably changed a hundred times.

So "real world results" are what I like and if a link from a PR 5 home page to the PR 2 directory home page didn't make the PR 5 drop to 4 or 3 then I wouldn't worry about PR leakage too much like you said it doesn't seem to make a difference. Infact link popularity was more geared to this type of linking. 2-3 links going outbound in articles to othersites were the heart of the link pop theory not a dedicated page with fabricated reciprical links that are named "links.html" etc etc. That is just an exploit of a pretty dang good idea.

If ya really wanna read up on hubs and authorities I'd recommend searching for literature by Mike Grehan.

but anyways do these outbound links affect PR in the sites that are linked to? Yup I saw the same affect to a new site that was linked to from my directory.

When people submit to Y! or ODP they think they're gonna get a huge bounce in PR. Not if you're burried lol. Some of Y! directory pages are 1 and 2. Not much of a help there. Great way to get spidered though but costly but that's a discussion for a different forum.

Hope my jabbering helped. I'm sure what I said will be thoroughly ripped apart but the proof is in the pudding.

doc_z

10:33 am on Apr 24, 2004 (gmt 0)

So "real world results" are what I like and if a link from a PR 5 home page to the PR 2 directory home page didn't make the PR 5 drop to 4 or 3 ...

I didn't see anyone who mentioned such a drastic effect. The question was rather why the leakage is (in practice) relatively small. Also, adding an external link to a PR5 home page primarily affects the PR of the inner pages, while the effect for the home page is of second order. Finally, you can't compare different situations in time (PR of the home page before and after adding a link) because the PR of home page would even change if the page is unchanged due to changes of the PR of the pages links to you, the number of links on those pages, a rescale of the ToolbarPR, a change in the damping factor and so on.

WebFisher

10:58 am on Apr 24, 2004 (gmt 0)

Doc_z - by *multi-iterative* PR calculation I meant that PR that you give to internal pages then comes back to your home in case there's a link to home on those pages which increases home page value which gives back increased PR to internals which increases the cumillative mathematical PR value of home which gives it back to internals again - and on and on it goes - and how many times it goes in cirles depends on how cleanly your site is interlinked. Of course it doesnt go forever and there are limitations on *increased PR value* iterations for sites that in Gbot's opinion are overdoing with that.
There is a great article about this by Chris Ridings of searchenginesystems.net (although in russian but you still can make out a lot out of the graphs there - [digits.ru ])
The TRUTH for now (although this policy is obviously known to google) - a brandnew site with no external links can roll itself out to some high PR provided it is a monster site and it's cleanly interlinked. It will take ages of course but it's still true as any new page has minuscule PR - the lowest it can possibly be in google's estimation which is then increased by proper cross linking.

doc_z

3:53 pm on Apr 24, 2004 (gmt 0)

... by *multi-iterative* PR calculation I meant that PR that you give to internal pages then comes back to your home in case there's a link to home on those pages which increases home page value which gives back increased PR to internals ...

Of course, you can think in such a model. However, this isn't part of the equations - it's just an intrepretation.

There is a great article about this by Chris Ridings of searchenginesystems.net (although in russian ...

Although my Russian isn't good, I guess that this is a short version of "PageRank Uncovered" by Ridings and Shishigin.

The TRUTH for now (although this policy is obviously known to google) - a brandnew site with no external links can roll itself out to some high PR ...

No. That's the important point: this would work if Google would use the original algorithm and just have changed the damping factor. However, they made significant changes which prevent this.

claus

4:29 pm on Apr 24, 2004 (gmt 0)

uhm... don't really know how to break this, but i'll try with an analogy...

There's this article on, say, fishing, that gets compared to other articles on fishing and rated using a voting system. Two people voting for this article are a professor and a student. A second article gets voted for by two students so that one does not get the same rating as the first one, as the professor's vote is perceived to have higher value than the student's.

So far so good. What i'm reading above implies that you guys seriously believe that:

- by voting for an article the professor becomes less of a professor than he was before voting? So, after voting for, say, 10 papers, the professor is back to being a student!?

...c'mon guys, you can't be serious about this.

AthlonInside

4:38 pm on Apr 24, 2004 (gmt 0)

If you have a candle which is lighted up.

You light up other candles.

What will happen to your candle?

It is as bright as before.

doc_z

5:45 pm on Apr 24, 2004 (gmt 0)

... an analogy...

... candle ...

This reminds me of a discussion about a gravitation wave experiment some years ago. Someone was arguing in a clever and logical way that the whole experiment couldn't work for several reasons and was asking how to explain this discrepancies. The answer was very simple: you have to take the equations and calculate these things.

There is nothing to say against analogies. It is a good way to explain complicate things with simple pictures. However, before you do this you have to go the hard way and verify these things in a scientific way.

I have explained this effect several times (the last time in this thread [webmasterworld.com]) and I won't start this discussion again.

claus

12:25 pm on Apr 25, 2004 (gmt 0)

>> I have explained this effect several times (the last time in this thread)

doc_z, i understand that you don't wany to double post, especially if the explanation is long or complicated, but that thread you pointed to does not contain support for your arguments, neither as algebra nor computation. Still, an argument like that deserves to be taken seriously.

I found one post here (Jan 28,04): [webmasterworld.com...]

The formula in that post is not identical to the original formula however. Here's the original formula:

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
 PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
(source: [www-db.stanford.edu...] - bold added)

Please note the part in bold. The expression

C(A)

is not found on the right side of the equal sign in the equation, only

C(T1-n)

The links found on the target page (Page A - ie.

C(A)

) are not part of the PR calculation for that page - only the links on pages that point to it.

doc_z

1:41 pm on Apr 25, 2004 (gmt 0)

The expression C(A) is not found on the right side of the equal sign in the equation, only C(T1-n)

Claus, please note that you have taken out a single equation from a whole set. Of course, C(A) is not part of this equation, but the (implicit) assumption that PR(Ti) is constant when changing C(A) is incorrect - at least in the case that your site consists of more than one page. (If your site consists of just one page then I doesn't make any difference if you're adding links or not. However, in this case you already wasted numerous amount of PR.)

Consider the simple case of having a website with m pages. Every of those m pages is linked to all other pages. All k incoming links go to the home page. Obviously, if you add an external link to your home page (no kind of reciprocal link) the PR of your internal pages is decreased due to the dilution of transferred PR. This means
PR(Home) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
is reduced due to the decrease of PR of those (m-1) internal pages caused by the change of C(Home):
PR(Internal_i) = (1-d) + d (PR(Home)/C(Home) + ... + PR(Internal_m)/C(Internal_m))

While this was just a simple example to explain the effect, the statement is even valid in complexer systems.

By the way, I was reffering to these threads:
Adding links on home page and PR? [webmasterworld.com]
Do outward links increase pagerank? [webmasterworld.com]
The (dis)advantage of exchanging links [webmasterworld.com]
External Linking help or hurt? [webmasterworld.com]
The Value of Outbound links now... [webmasterworld.com]

Just for clarification: I'm considering the case of adding a link to a page/site which doesn't link back to your page/site and which won't do that even if you link to them (i.e. Google). Also, I'm not saying anything about the order of the effect (an estimate was given here [webmasterworld.com]). Of course, I know that PR is just one parameter and that even if PR is decreased the ranking can be improved. And I know that I was ignoring some non-significant higher order effect just for simplicity.

claus

3:44 pm on Apr 25, 2004 (gmt 0)

Perhaps this is all semantics, dunno.

Consider the issue of what PR really is: The total amount of links pointing to any page in the system is 100%. By adding one link or subtracting one the whole system will still be 100%. The PR of any page expresses the likelihood that by taking a random link from the total pool of links, it will lead to your page. So, if you add a link pointing out of your site you would expect to decrease the probability that your site is found when selecting a random link from the pool - but is that fear real, and will you lose PR or not?

You don't really have to compute the whole system to find out, as the case of zero and the case of one should be enough to establish the logic.

-------------------------------------------
Assumptions:
- two pages in total; (A) and (B)
- d=0.85
- seed PR(A)=0.5

Case 1: Neither page links to each other (ie. no links in the system)

(a)

PR(A) = (1-d) + d (PR(B)/C(B))

-> 0.15 + 0.85 * (0.5/0) = 0.15 = 50%

(b)

PR(B) = (1-d) + d (PR(A)/C(A))

-> 0.15 + 0.85 * (0.15/0) = 0.15 = 50%

Case 2: Page A links to Page B

(a)

PR(A) = (1-d) + d (PR(B)/C(B))

-> 0.15 + 0.85 * (0.5/0) = 0.15 = 35%

(b)

PR(B) = (1-d) + d (PR(A)/C(A))

-> 0.15 + 0.85 * (0.15/1) = 0.28 = 65%

-------------------------------------------

In both cases, PR(A) = 0.15, so page A does not lose any pagerank due to linking out. On the other hand, the total amount of PR in the system is increased, and that increase goes to site B.

In this example the effect seems dramatic, but IRL the effect of one link over 4.3 billion links means close to nothing, ie something like this:

1 / (4,285,199,774 + 1) = 0.0000000002

The effect of this number of course depends on the PR on the linking page, as well as the "d" factor, but i dare say that for all practical purposes there's no real reason to fear that you disadvantage your own site by linking out, not even by having hundreds of links on a page.

In other words: 50.00000002% is extremely close to 50% - add a hundred links and even that is not a very significant change.

Also, that 4.3 billion number is pages, not links, so the real number is higher, which means that the effect of feeding one extra link into the system is even more marginal.

-----------
And, of course, you can rank well even if you don't have high PR.

doc_z

6:33 pm on Apr 25, 2004 (gmt 0)

... so page A does not lose any pagerank due to linking out ...

Yes, you're right. But if you look at msg #21 you'll see that I already mentioned that the case that a site consists of just one page is a special case because you already wasted PR. In this case there is indeed no disadvantage in adding a link - as already said.

You have to consider a more realistic example, i.e. two websites (A) and (B) with 2 pages, i.e.

- A1 links to A2
- A2 links to A1
- B1 links to B2
- B2 links to B1

You'll find that PR(A1)=PR(A2)=PR(B1)=PR(B2)=1. The average PR is one, as always if there are no dead ends. If you add a link from A1 to B1 you get

- PR(A1) = 1 - d^2 / (2-d^2)
- PR(A2) = 1 - d / (2-d^2)
- PR(B1) = 1 + d / (2-d^2)
- PR(B2) = 1 + d^2 / (2-d^2)

This means the PR for A1 and A2 is decreased while the PR for B1 and B2 is increased. The total amount of PR in the system is unchanged.

...the total amount of PR in the system is increased, and that increase goes to site B.

As already menioned, this is caused due to the special situation of dead ends.

claus

10:57 pm on Apr 25, 2004 (gmt 0)

There are some numerical errors in my calculations above, but those do not change the conclusion.

I wish i could just agree with you, but i'm still not convinced. Perhaps i just need to spend more time with the equations, that usually helps when in doubt. Anyway, i didn't speculate about websites, A and B are just two pages. I' ve tried adding a third page and calculated:

0) A , B , no links
1) A -> B
2) A -> B -> A
3) A -> B -> C
4) A -> B , A -> C

I didn't do A -> B -> C -> A as that one would give the same result as (2) and i didn't do (4) with an extra B -> C as i found it pointless, but perhaps i was wrong.

In all cases PR(A) does not go below the initial 0.15 (= "d"), and it's fairly easy to see why. For PR(A) to decrease, the result of this part of the PR equation has to be negative:

d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

This will not happen unless:

a) PR for at least one page is negative.
b) Number of links on at least one page is negative.

None of these seem to be credible, but perhaps i'm wrong. Also, it might happen if:

c) factor "d" is bigger than 1 or negative

Still, in that case it would be so for all cases (0) to (4) so that wouldn't really change my mind, i guess.

I don't know where i'm going in the wrong direction, as i can't follow your point, but perhaps my starting point is wrong as i assume that Page A is the first page on the web. I don't know if i'll change my mind if i give Page A initial inbounds, it might be so. Also, it might still be a matter of words, eg. if you talk about the percentage value for PR (the probability of a click) and i talk about the numerical one (the raw score). In that case we can both be right although we reach different conclusions.

doc_z

12:39 pm on Apr 26, 2004 (gmt 0)

In all cases PR(A) does not go below the initial 0.15 (= "d"), ...

Of course, that's correct. The problem with you examples is that you start with pages which have no outgoing links, i.e. they're dead ends. In this case there is indeed no disadvantage in adding a link, but you already wasted PR in creating a page with no link on it. (Especially when you have a page with no incoming or outgoing link it has the minimal PR = (1-d) which cannot be reduced.) Dead ends are reducing the total PR of the system. For a linking structure without dead ends the total PR is equal to the number of pages. Adding a link won't change the total PR but the distribution. Therefore, increasing the PR somewhere in the system (the page which get an additional incoming link) leads to a decrease somewhere else. If you consider normal pages/sites, there are no dead ends and the site has more than one page. That's the case I'm talking about.

The easiest way to calculate the PR for complex structures is numerical calculation. You can either use one of the PR calculators on the web or write your own.

Also, it might still be a matter of words, e.g. if you talk about the percentage value for PR (the probability of a click) and i talk about the numerical one (the raw score).

No, I was always talking about real PR (e.g. msg#23), i.e. the PR obtained by the formula given in msg #20.