|Keep PR within the site by robots.txt|
Use robots.txt to transfer PR only to notional pages
| 5:53 pm on Nov 13, 2003 (gmt 0)|
I would like to keep PR within notional pages of my site only, not transfer it to garnish like disclaimer, sitemap, call me back etc.
If I put those pages in the robots.txt file this would apparently keep robots away from them but if there are still plain text href links to garnish pages left on the page will it transfer some of pagerank to them anyway?
| 7:15 pm on Nov 13, 2003 (gmt 0)|
Google is a polite robot and will not fetch URLs excluded by /robots.txt
Those URLs may be listed in Google (just the URL; no title, snippet or cache) so you do give them PageRank. Because they're not fetched, they won't give you any back so you loose a little over all.
| 7:44 pm on Nov 13, 2003 (gmt 0)|
The idea is too transfer actual total PR only between 4-5 notional pages. Taking note of Brin's and Page's initial PR formula if home page has PR of 200 and there are 4 pages only to transfer PR to, it will give to each of them:
(1-0,85)+(0,85*200/4) - 42,65
If there are 9 pages to transfer PR to the number will make
(1-0,85)+(0,85*200/9) - 19,03
I obviously want to give 42 than 19 to notional pages within the site
The question is HOW to shoo Gbot away from those pages - will just disallowing them in robots.txt do?
| 7:51 pm on Nov 13, 2003 (gmt 0)|
I have a theory about this. If Google finds links to pages that it isn't allowed to index, then it should treat those links as 'dangling', and the links won't affect the PR distribution very much at all. That's my theory, anyway. You can read about dangling links in Brin and Page's original PageRank paper.
| 8:10 pm on Nov 13, 2003 (gmt 0)|
What I'll do will be setting JS navigation to garnish pages and only put plain href's to them at site map.
This will most apparently take some PR only from sitemap not notionals.
| 8:10 pm on Nov 13, 2003 (gmt 0)|
URLs disallowed from fetching do count for the number of links on a page in the PageRank calculation, certainly since July and I assume previously as well. Since the August update 404s have been sucking PR too, but they didn't in July. (I've not checked this month)
| 8:15 pm on Nov 13, 2003 (gmt 0)|
| 9:59 pm on Nov 13, 2003 (gmt 0)|
Yes, all links on a page count as links but when Google hasn't spidered the remote page, they are called dangling links, and dangling links are dropped from the PR calculations within the first few iterations and put back again a few iterations before the end. In that way, they have minimal effect on the resulting PRs of other pages. That's according to Brin and page's original document.
Google doesn't spider pages that the robots.txt file says it can't index. So they are the same as pages that they haven't even found yet but have links to. That's why I believe those 'Contact' type pages will be treated as danglings and, if they are, they won't suck up any PR - or a very minimal amount.
| 10:02 pm on Nov 13, 2003 (gmt 0)|
| 2:02 am on Nov 14, 2003 (gmt 0)|
You seriously want to keep googebot away from your sitemap page?
| 3:07 am on Nov 14, 2003 (gmt 0)|
Phil, I suggest that we run a thought experiment, using a massive rank source into PageA (i.e. anything noticeable on the Toolbar scale), where PageA has 19 links. One of those 19 is to PageB, the others are to /robots.txt excluded URLs. For simplicity, PageB has links to the Web, but doesn't link back to PageA or its neighbourhood.
Case I: Dangling links are counted at each iteration.
After enough iterations, we expect PageA to converge to a steady PR. PageB converges on much less; we'll call it 1 less than PageA on the Toolbar. There was a reason for choosing 19 links.
Case II: Dangling links are not counted until near the end.
After enough iterations, we expect PageA to converge to a steady PR. This time, PageB converges on very slighly less; we'd would probably call it something like 1/30 less than PageA on the Toolbar (not that we get to see it of course). I think this is the where the 'dangling links don't suck PR' ideas came from, but there's a problem. When the link is put back, it should take only _one_ iteration for PageB to snap down to a low PR as in Case I. If pageB links to pageC links to PageD etc., then it will take a few iterations for the PR sucking to trickle down.
I haven't really been able to test this as it looks like there are quite a few iterations after the dangling links are put back, if they're taken away at all.
Remember that PageB doesn't link back? Even if it links to PageA, and only to PageA, we can add maybe three or four iterations I think.
| 9:25 am on Nov 14, 2003 (gmt 0)|
Currently Gbot cannot follow this link:
document.write('<a href=\"http://' + n1 + n2 + '\">');</script>
Although it does try.
| 9:53 am on Nov 14, 2003 (gmt 0)|
To test your experiment would require a pencil, some paper, and a fair amount of time - or a PR calculator that can be set up to remove and re-insert dangling links on various iterations. So....
I've never made any attempt to calculate the effect of a dangling link because B&P said at the start what happens to them:-
"...Because dangling links do not affect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated. After all the PageRanks are calculated they can be added back in without affecting things significantly."
I've always taken that to mean that dangling links have no effect on the PRs of other pages. But in another place they said that it would take only a few iterations for them all to be removed, meaning that they are removed during the first two or three(?) iterations and each of them is in the calculations for a short time, so there would be a small effect.
| 11:55 am on Nov 14, 2003 (gmt 0)|
Although there are numerous papers about numerical PR calculation dealing with Case II, this isn't the correct way. Only Case I yields the solution of the set of the underliying linear equations (for d!=0). (The reason that Case II was considered is that the original paper is dealing with a damping factor of d=0, which is mathematically a different case, i.e. the computation of eigen vectors.)
However, this doesn't mean that Google is using Case I for PR calculation.
By the way, the number of iterations strongly depends on the iteration scheme.
| 1:05 pm on Nov 14, 2003 (gmt 0)|
| 8:16 pm on Nov 15, 2003 (gmt 0)|
Phil, I think the key aspect of "...each of them is in the calculations for a short time, so there would be a small effect" is that if they're put back near the end, there could be a huge affect on the PR of some pages.
Much as though I would love to pretend I play with PR using pencil and paper, I tend to use a spreadsheet (or just staring at a blank wall if I'm feeling brave).
Doc, although I take your point about resolving the equations; in paractice the difference in results between case I and II is zero for the immediate neighbourhood (i.e. for URLS not too far in the link map from the dangling links). But if for example, you have a very deep site with dangling links on your home page, then you should see a large difference between case I and case II. On a very deep site of mine with dangling links near the top, the results seem to match case I.
| 8:46 pm on Nov 15, 2003 (gmt 0)|
To be honest, if I do not want Google to index such a page, I'd just add
<meta name="robots" content="noindex,follow">
in the header.
Maybe, there's a small PR loss for the other pages - but then, in the time it takes to sort out the alternative link options, I could acquire a good, relevant link that makes up for this loss - and adds value for my visitors.
| 9:00 pm on Nov 17, 2003 (gmt 0)|
the question about the difference in PR between case I and II is quite complicate. It strongly depends on the iteration scheme as well as the number of iterations which are performed to compute the PR of the dangling pages (in case II). Consider, for example, a chain of pages (X1, X2, X3, ...), where the first page is linked to the second pages which is link to the third page and so on. The last page is a dead end. In case II, all these pages have to be taken out of the calculation. Thus, it takes n iterations where the dangling pages are included (in case II) until pages Xn get a non-zero PR if the simple Jacobi iteration is used. (I never had any problems whith such chains of pages. Thus I would conclude that Google is either using a different iteration schemes or computes PR according to case I. I would guess they are doing both.)
Also, the difference between case I and II depends on the question if PR of the non-dangling pages is fixed during the final PR computation or not. The first case is much faster, but less accurate.
Of course, for a global view the difference betwenn case I and II might be not important. However, for the own page/site the difference can be significant. Also, pages can be even affected if they are not in the neighbourhood of dangling pages.
The reason that Kamvar et. al. still remove dangling pages is that they still consider the PR calculation as the determination of eigen vectors. This requires a non zero determinante for the transition matrix, i.e. pages which have at least one outgoing links. They claim that this these technique is accelerating the compatation. However, there are well-known algorithms for sparse matrices which are faster.