homepage Welcome to WebmasterWorld Guest from 54.205.119.163
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Keep PR within the site by robots.txt
Use robots.txt to transfer PR only to notional pages
WebmasterFisherman




msg:171936
 5:53 pm on Nov 13, 2003 (gmt 0)

I would like to keep PR within notional pages of my site only, not transfer it to garnish like disclaimer, sitemap, call me back etc.

If I put those pages in the robots.txt file this would apparently keep robots away from them but if there are still plain text href links to garnish pages left on the page will it transfer some of pagerank to them anyway?

WF

 

ciml




msg:171937
 7:15 pm on Nov 13, 2003 (gmt 0)

Google is a polite robot and will not fetch URLs excluded by /robots.txt

Those URLs may be listed in Google (just the URL; no title, snippet or cache) so you do give them PageRank. Because they're not fetched, they won't give you any back so you loose a little over all.

WebmasterFisherman




msg:171938
 7:44 pm on Nov 13, 2003 (gmt 0)

The idea is too transfer actual total PR only between 4-5 notional pages. Taking note of Brin's and Page's initial PR formula if home page has PR of 200 and there are 4 pages only to transfer PR to, it will give to each of them:

(1-0,85)+(0,85*200/4) - 42,65

If there are 9 pages to transfer PR to the number will make
(1-0,85)+(0,85*200/9) - 19,03

I obviously want to give 42 than 19 to notional pages within the site

The question is HOW to shoo Gbot away from those pages - will just disallowing them in robots.txt do?

PhilC




msg:171939
 7:51 pm on Nov 13, 2003 (gmt 0)

I have a theory about this. If Google finds links to pages that it isn't allowed to index, then it should treat those links as 'dangling', and the links won't affect the PR distribution very much at all. That's my theory, anyway. You can read about dangling links in Brin and Page's original PageRank paper.

But to make sure that PR isn't attributed to those pages, use javascript links to them. Googlebot doesn't see those.

WebmasterFisherman




msg:171940
 8:10 pm on Nov 13, 2003 (gmt 0)

Thanks PhilC

What I'll do will be setting JS navigation to garnish pages and only put plain href's to them at site map.
This will most apparently take some PR only from sitemap not notionals.

ciml




msg:171941
 8:10 pm on Nov 13, 2003 (gmt 0)

URLs disallowed from fetching do count for the number of links on a page in the PageRank calculation, certainly since July and I assume previously as well. Since the August update 404s have been sucking PR too, but they didn't in July. (I've not checked this month)

Googlebot is be able to follow full URIs (i.e. including http:// ), even in Javascript.

WebmasterFisherman




msg:171942
 8:15 pm on Nov 13, 2003 (gmt 0)

Googlebot is be able to follow full URIs (i.e. including http:// ), even in Javascript.

Yeah but what's about "incomplete" urls like en/content/somepage.html in javascript MenuItem?

PhilC




msg:171943
 9:59 pm on Nov 13, 2003 (gmt 0)

Hi ciml,

Yes, all links on a page count as links but when Google hasn't spidered the remote page, they are called dangling links, and dangling links are dropped from the PR calculations within the first few iterations and put back again a few iterations before the end. In that way, they have minimal effect on the resulting PRs of other pages. That's according to Brin and page's original document.

Google doesn't spider pages that the robots.txt file says it can't index. So they are the same as pages that they haven't even found yet but have links to. That's why I believe those 'Contact' type pages will be treated as danglings and, if they are, they won't suck up any PR - or a very minimal amount.

PhilC




msg:171944
 10:02 pm on Nov 13, 2003 (gmt 0)

I forgot to ask - do you have any evidence that Googlebot actually crawls absolute URLs that it finds in Javascript links?

oodlum




msg:171945
 2:02 am on Nov 14, 2003 (gmt 0)

You seriously want to keep googebot away from your sitemap page?

ciml




msg:171946
 3:07 am on Nov 14, 2003 (gmt 0)

Phil, I suggest that we run a thought experiment, using a massive rank source into PageA (i.e. anything noticeable on the Toolbar scale), where PageA has 19 links. One of those 19 is to PageB, the others are to /robots.txt excluded URLs. For simplicity, PageB has links to the Web, but doesn't link back to PageA or its neighbourhood.

Case I: Dangling links are counted at each iteration.

After enough iterations, we expect PageA to converge to a steady PR. PageB converges on much less; we'll call it 1 less than PageA on the Toolbar. There was a reason for choosing 19 links.

Case II: Dangling links are not counted until near the end.

After enough iterations, we expect PageA to converge to a steady PR. This time, PageB converges on very slighly less; we'd would probably call it something like 1/30 less than PageA on the Toolbar (not that we get to see it of course). I think this is the where the 'dangling links don't suck PR' ideas came from, but there's a problem. When the link is put back, it should take only _one_ iteration for PageB to snap down to a low PR as in Case I. If pageB links to pageC links to PageD etc., then it will take a few iterations for the PR sucking to trickle down.

I haven't really been able to test this as it looks like there are quite a few iterations after the dangling links are put back, if they're taken away at all.

Remember that PageB doesn't link back? Even if it links to PageA, and only to PageA, we can add maybe three or four iterations I think.

I think we'd need to dig deep into the forum3 archive to find the first mentions of Google following URIs in Javascript. Matt Cutts suggested in Boston that Google would find more hypertext references, so maybe URLs soon too?

tantalus




msg:171947
 9:25 am on Nov 14, 2003 (gmt 0)

Currently Gbot cannot follow this link:

<script type="text/JavaScript"> var n1='www.examp'; var n2='le.com';
document.write('<a href=\"http://' + n1 + n2 + '\">');</script>
www.example.com
<script type="text/JavaScript"> document.write('<\/a>');</script>

Although it does try.

PhilC




msg:171948
 9:53 am on Nov 14, 2003 (gmt 0)

Hi ciml,

To test your experiment would require a pencil, some paper, and a fair amount of time - or a PR calculator that can be set up to remove and re-insert dangling links on various iterations. So....

I've never made any attempt to calculate the effect of a dangling link because B&P said at the start what happens to them:-

"...Because dangling links do not affect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated. After all the PageRanks are calculated they can be added back in without affecting things significantly."

I've always taken that to mean that dangling links have no effect on the PRs of other pages. But in another place they said that it would take only a few iterations for them all to be removed, meaning that they are removed during the first two or three(?) iterations and each of them is in the calculations for a short time, so there would be a small effect.

Because of what you said about absolute Javascript links, I've got a couple on test. One thing that I am certain of is that Google's parsing program doesn't yet interpret Javascript. So it cannot see a link that has been broken up in to pieces and then recompiled as required.

For whatever reason, Googlebot doesn't even see iframe sources, so I'll be surprised if it sees absolute URLs in javascript links.

doc_z




msg:171949
 11:55 am on Nov 14, 2003 (gmt 0)

Although there are numerous papers about numerical PR calculation dealing with Case II, this isn't the correct way. Only Case I yields the solution of the set of the underliying linear equations (for d!=0). (The reason that Case II was considered is that the original paper is dealing with a damping factor of d=0, which is mathematically a different case, i.e. the computation of eigen vectors.)

However, this doesn't mean that Google is using Case I for PR calculation.

By the way, the number of iterations strongly depends on the iteration scheme.

PhilC




msg:171950
 1:05 pm on Nov 14, 2003 (gmt 0)

oodlum, I don't think that anyone would want to keep googlebot away from a sitemap page, but there is good PageRank reason to limit the number of spider-viewable links to it, because it isn't uaully a page that is wanted to rank well. It makes sense to have one spider-viewable link to it, and the rest of the links hidden. The same applies to pages like 'contact', 'tos', 'privacy policy', etc., although there is no reason to have any spider-viewable links to those.

ciml




msg:171951
 8:16 pm on Nov 15, 2003 (gmt 0)

Phil, I think the key aspect of "...each of them is in the calculations for a short time, so there would be a small effect" is that if they're put back near the end, there could be a huge affect on the PR of some pages.

Much as though I would love to pretend I play with PR using pencil and paper, I tend to use a spreadsheet (or just staring at a blank wall if I'm feeling brave).

Doc, although I take your point about resolving the equations; in paractice the difference in results between case I and II is zero for the immediate neighbourhood (i.e. for URLS not too far in the link map from the dangling links). But if for example, you have a very deep site with dangling links on your home page, then you should see a large difference between case I and case II. On a very deep site of mine with dangling links near the top, the results seem to match case I.

lbobke




msg:171952
 8:46 pm on Nov 15, 2003 (gmt 0)

Hmm,

even if there is something like a "dangling link" effect on PR - would it outweigh the effect that Javascript links have on humanoids who have disabled Javascript in their browsers?

To be honest, if I do not want Google to index such a page, I'd just add
<meta name="robots" content="noindex,follow">
in the header.
Maybe, there's a small PR loss for the other pages - but then, in the time it takes to sort out the alternative link options, I could acquire a good, relevant link that makes up for this loss - and adds value for my visitors.

Laurenz

doc_z




msg:171953
 9:00 pm on Nov 17, 2003 (gmt 0)

ciml,

the question about the difference in PR between case I and II is quite complicate. It strongly depends on the iteration scheme as well as the number of iterations which are performed to compute the PR of the dangling pages (in case II). Consider, for example, a chain of pages (X1, X2, X3, ...), where the first page is linked to the second pages which is link to the third page and so on. The last page is a dead end. In case II, all these pages have to be taken out of the calculation. Thus, it takes n iterations where the dangling pages are included (in case II) until pages Xn get a non-zero PR if the simple Jacobi iteration is used. (I never had any problems whith such chains of pages. Thus I would conclude that Google is either using a different iteration schemes or computes PR according to case I. I would guess they are doing both.)

Also, the difference between case I and II depends on the question if PR of the non-dangling pages is fixed during the final PR computation or not. The first case is much faster, but less accurate.

Of course, for a global view the difference betwenn case I and II might be not important. However, for the own page/site the difference can be significant. Also, pages can be even affected if they are not in the neighbourhood of dangling pages.

The reason that Kamvar et. al. still remove dangling pages is that they still consider the PR calculation as the determination of eigen vectors. This requires a non zero determinante for the transition matrix, i.e. pages which have at least one outgoing links. They claim that this these technique is accelerating the compatation. However, there are well-known algorithms for sparse matrices which are faster.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved