Welcome to WebmasterWorld Guest from 184.108.40.206
Forum Moderators: open
In otherwords, can I write a PR 6 site without help from backlinks.
PS: a PR 6 is certainly possible iwth enough pages. I have one such myself albeit with more then 1 incoming link. If it has enough pages and keyword rankings teh incoming links will happen on their own.
You can design your site structure to determin where you want your pagerank to go but you cant produce pagerank.
so for example if you have a site, indexed by google. Now all external links are removed, but google still indexes you because it already knows where to find you. You can still have very high pagerank. In fact the total pagerank for your site is very close to directly proportional to the number of pages in your site. In fact there have been studies that showed that a pages rank is often closely related to the number of pages in the site.
Granted you will still get visited by goglebot but you will loose your pagerank. pagerank going to your site is directly controled by the sites that link to you. You can only control the rank of pages within your site by using the pagerank that is avalable to you.
My way of thinking about it is each site has a value of the total pagerank available. You can not do anything to increase this because this is determined by external factors. If you have 10000 pages with no site pagerank then you have nothing to distribute around your site.
If macks views were true, then pagerank could not exist bigger than 0.
So - moving onwards, now that we are 100% certain that volume of pages has a proportional bearing on PR, what would people say was the best linking structure to force the maximum amount of PR onto one page within the site (the index page).
In my view, the maximum bottom line is that irrespective of multiple levels of pages within the site, irrespective of messing around and linking pages to each and every other page etc, the absolute bottom line is that maximum PR is given to the index page when EVERY SINGLE PAGE in the site links to it - period.
Each page (assuming that there are no dead ends on this site) "produces" 1 PR. Therefore, even if there are no significant inbound links you can have a high PR for some pages. However, since the average PR on this site is one and TollbarPR has a log scale this would just work for a few pages and many pages are necessary. Of course, PR distrubution depends on the site structure. To have some high PR pages without significant inbound links you must have (or better: it is much easier with) a hierarchical structure, while for 'flat' structures PR is distributed more equally.
Of course, it is much easier to get some high PR incoming links than create so much pages.
> You have a 1000 page site with 400 inbound on topic links giving you a page rank of 6 on your home page. One day all the sites that link to you remove their links and you are left with no inbound links. Do you remain a pagerank 6? I dont think so.
Of course, this will reduce PR for your pages. External links contribute to your PR. However, the question was if a site that 'has only one backlink from a PR 4 site can get a value of PR 6 from its own internal linking structure'. And killroy gave the correct answer.
An absolute best-case scenario, assuming no damping factor would have, as I understand it, the main page accumulating about 1/2 point of raw PR for every page on the site. The logarithm base used for toolbar PR was adjusted at the last dance, but is believed to be approximately 6. So to get to PR1, a site would need 12 pages (6 * 2). to get to PR2 it would need 6*6*2 = 72. To get to PR10 it would need 6^10 * 2 = 120,932,352 pages. According to these figures, 100,000 pages would be enough to get a modest PR6.
Figure in a damping factor, and more pages are needed.
If I'm off on my math somewhere, please let me know. But I think I'm close.
This all says that high PR w/o decent external links are theoretically possible but are outside the reach of most webmasters. Let's see, 1 new page per day, and I'll have PR 6 in ... just 255 years!
I think you need to start with at least one inbound before you can have ANY PR.
Also, I can't prove it but I believe the idea of creating a site based on the assumption that the more pages you have (each presumably linking to at least the home page), the greater your home page's PR will be, is purely false. I have one site with a couple hundred thousand pages, another with about 50K, and another with about 100K. The couple hundred thousand page site has equal PR with the 100K one. And the 50K one isn't far behind. The difference between the three is purely inbounds.
I have heard it said that there is a mechanism within the algo to begin discounting inbounds when they are all in from the same domain. I can't prove this - I suppose it is pure speculation. But I have a stable of PR6s pointing to my home page and the home page is only a PR6. It has been a 7 half of the months over the past year but I have seen no evidence that simply by creating more pages I can move it.
[edited by: killroy at 6:02 pm (utc) on April 9, 2003]
The question was 'Is it Possible?' and not if such a site exists. Of course, if you have a site with millions of pages you will get significant incoming links, even if you don't do anything.
> As I understand things, the whole iterative process begins with some assumptions including PR for the home page of certain sites such as Google itself's home page, DMOZ, Yahoo and perhaps others.
Calculating PR is nothing else than solving a linear system of equations respectively calculating an eigen vector. For a damping factor 0 < d < 1 there is an unique solution. This solutions doesn't depend on the algorithm used to solve the problem neither on the initial values. And you can simply show that - if there are no dead ends - the average PR is one, i.e. each page 'produce' a PR of one.
> I think you need to start with at least one inbound before you can have ANY PR.
Yes, but the reason is that Google will not consider sites as long as they have no incoming link. However, for d > 0 you can calculate PR even if there is no incoming link (only for d=0 a problem exists).
> Also, I can't prove it but I believe the idea of creating a site based on the assumption that the more pages you have (each presumably linking to at least the home page), the greater your home page's PR will be, is purely false.
I didn't said this. It strongly depends on the linking structure as already mentioned. However, normally the PR increases if the number of pages increases. The reason that you normally won't see an effect is that the Toolbar just show an integer of a log scale.
Let's call PR the most familiar PR, the one you read on GoogleBar, and call EPR (E for "exponential") the true PageRank, which should be something like (c times (6 power PR)) if I believe what I read in this forum (where c is a constant : I normalize EPR so that its total sum on the whole web is 1).
EPR of a page is proportional to the time a random surfer spends on this page.
I shall suppose that average.html is an "average" page. I mean by that a page where the random surfer reads on average 1 time every 3,083,324,652 consultation of the web. Incidentally, just after writing the previous sentence, I wondered what the PR of such an average page would be, probably something like 1, but I saw no way to estimate it precisely.
Let's compare it with justExisting.html. I suppose this page is linked from outside -otherwise it would be unknown from Google- and contains at least one outgoing link -otherwise its PR computation would be outside the main algorithm. But this unlucky page is only linked by one link out of an awful mess of one hundred, and from the crappier page of the whole www ; this will bring nearly no visit of the random surfer, so the page can only hope on its own existence to bring some PageRank.
EPR of average.html is by definition of "average" 1/3,083,324,652. What is EPR of justExisting.html?
We cannot know it if we don't know the vector E ruling the behaviour of the random surfer when he chooses not to follow a link ; without any knowledge of this vector, let's suppose it models a uniform distribution. As is known from Google's public info (and might have changed since, let's assume it has not) the random surfer follows a link 85 % of his moves, and uses vector E on 15 % of his moves. It means that he visits justExisting.html on 1/3,083,324,652 * 15 % of his moves, or if you like it better that justExisting.html's EPR is exactly 15 % of average.html's EPR; or if you like it better that justExisting.html's PR is about 1 less than average.html's PR.
You can also say it this way : just by existing, a page receives the same PageRank than from a link leaving an"average" page containing about six links.
>Calculating PR is nothing else than solving a linear system of equations respectively calculating an eigen vector.
I admit that this is way over my head. I'm too many years removed from any sort of formal mathematics. What I disagreed with was the following statement:
>Each page (assuming that there are no dead ends on this site) "produces" 1 PR.
I have disproven this empirically for myself. I wanted to see if this was true so I created a new site, linked to it from a PR6 page on an older, more established site, and then generated an asp generated site based on a single variable. The new site had 53,000 pages all of which linked to the home page. There are no dead links. Initially I had just the home page and a PR4 for two months. Then I dropped in the ASP code and generated 50,000 pages created from plain hypee links to the asp template. For the next several months all pages were spidered, a few were listed, but the home page remained at PR4. Somthing like 52,800 pages continued to show PR0 via the toolbar. Next links were listed on a PR4 DMOZ category page and maybe another 13 links from a combination of 4s, 5s, and 6s. The PR went to 5, about 34,000 pages were listed, and most pages now show something in the toolbar.
As I say, you may be right in your understanding of the mathematics. I'm not that bright. But I do now what I see and I was unsuccessful at "creating PR" with no external links.
Also 1 PR raw <<< 1 toolbar PR. That is also why a single link form a PR 3 or even PR 4 site is worth more then 10000 links from internally linked sites.
I havea site with over 230,000 indexed, crawled and ranked pages. Btu the links they have to the home page probably count less then the 200 odd PR 2-PR 5 incoming links this site has. Simply a matter of scale.
On the other hand a truly huge site of >10,000,000 pages will have an inherent PR of probably 7, 8 or perhaps even 9 without any effect of incoming links.
It's jsut a question of running through the officially published formula, which clearly states the inherent value of a sites vote. If no site would "produce˛ PR as you call it, then there would simply be NO PR to go around the web at all... The formula is iterative, and does not need any prior assumption such as a "given" reference PR of some site.
PS: also refusing to believe something doesn't make the maths go away ;)
if additional pages wouldn't produce PR as described this would be a discrepance to the original algorithm. That is unlikely, but no one (except GoogleGuy) can rule out this. However, I don't see evidence for this.
> For the next several months all pages were spidered, a few were listed, but the home page remained at PR4.
If there were no dead ends and all pages were spidered and indexed (this is necessary, otherwise they wouldn't exists for Google and couldn't increase PR) then there are two reasons why your ToolbarPR for the HP doesn't change:
- PR is increased, but it is still within the PR4 range (since it is a log scale).
- PR is not increased or decreased: when adding pages to your homepage you not only increases the total amount of PR, but you also change the distribution. Therefore, your homepage may benefit from the additional PR, but at the same time PR is distributed in a different way which lead to a decrease. (Also, the total sum on your site increases.)
I expect that for a realistic site structure the second effect will not occur and the explanation will be the first point. (However, in any case the structure of your site is the important point.)
> Somthing like 52,800 pages continued to show PR0 via the toolbar.
As said most of the pages will have a PR < 1 which corresponds to ToolbarPR=0.
> Next links were listed on a PR4 DMOZ category page and maybe another 13 links from a combination of 4s, 5s, and 6s. The PR went to 5, about 34,000 pages were listed, and most pages now show something in the toolbar.
I think we all agree that incoming links will increase the PR of all pages on your site.
The only reason PR is high for some pages is the way the net is lumped up. Sites are usually well-conected but intersite links are relatively sparse. So PR of a site is usually bunched into a few pages, there for PR transmitted from one of those bunches to another site is MUCH GREATER then the PR of a single added internal page with potential PR of just 1 raw PR.
What I am trying to say is, you will only find a page without incoming links with a high PR if it has a SIGNIFICANT number of pages relative to the rest of the net. My guess would be in the region of millions or tens of millions.
Maybe someday I sacrifice a domain to penalty and mush up 10,000,000 pages, link to them, get them indexed, remove the incoming link, wait for reindex and hope nobody else links in. Then I wait for another reindex and check the PR.... would be interesting to see, but also harsh on server resources and Googles resources.
PS: GG don't hit me is somebody else does this now, please.
[edited by: killroy at 6:00 pm (utc) on April 9, 2003]
My guess is that Google doesn't allow the seed PR assigned to each page to be transferred to other pages on the site. There is no such thing as a PR perpetual motion machine. If it were possible for a site to generate decent PR by itself, I'd have figured someone, either by design or accident, would have done so, and an actual example in the wild would have been spotted by now.
I read with greatest interest your test of big-site-making. I checked what it could mean on the maths level, and I now post what I think it implies, so that readers can point possible mistakes, or deepen my conclusions.
In what follows, I shall abbreviate N=3,083,324,652 (the size of the web as known by Google).
First we shall list the assumptions I make :
(1) Your story is true, and your experimental site was not absurdly linked ; (very likely to be true)
(2) PageRank algorithm is still what was described in Google's original articles, in the following meaning : there exists a function EPR(page) which is proportional to the number of visits of a random surfer to this page ; in this post I shall normalize EPR by :
Sum on the web of every EPR(page) = N.
That means that if the random surfer reads N=3,083,324,652 pages of the web, EPR(specific_page) is the average number of times he reads the specific page ; (very likely to be true)
(3) When the surfer teleports, he falls on any page on the Worldwide web with uniform likelihood ; (quite dubious)
(4) PR as seen on GoogleBar is logarithmic, that is there are constants a and b such that :
PR = a ln(EPR) + b ; (rather likely)
(5) a = 1/ln 6 ; (often asserted there, I have no opinion on the matter)
(6) b >= 1.(I shall comment it further)
And now the conclusions I obtain :
(using (2) all along) send the random surfer on a long trip visiting N pages ; I use the letter d for the damping factor (d=0.15 according to Google's first articles, but the value is not needed in the computations). Then, on his N visits :
(a) the surfer teleports dN times ;
(b) the size of your site is not very different from (6 power 6), so the site makes a proportion of about :
[(6 power 6)/N] of the web (uses assumption (1)) ;
(c) hence the surfer teleports
[d times (6 power 6)] to your site (uses assumption (3));
(d) every time the surfer enters on the site by teleportation, he sees (1/d) page on average before teleporting again (elementary maths about geometric law - uses no assumption except that your site does not leak PR, a reasonable part of (1)) ;
(e) only accounting on his teleportation visits, the surfer sees (6 power 6) pages on your site (uses no assumption) ;
(f) among these (6 power 6) visits, at least (6 power 5) reach your home page (uses assumption (1)) ;
(f) hence PR(your home) >= 5a ln(6) + b (uses assumption (4)) ;
(g) hence PR(your home) >= 6 (uses assumptions (5) and (6)) ;
(h) but PR(your home) = 4 (uses assumption (1)).
Hence there is something wrong, one of the assumptions must be wrong.
I would bet on assumption (3) : teleporting really at random would not give a sufficient penalty to selfish webmasters giving no outgoing links. This computation gives me a supplementary reason to believe there is some subtlety in the teleportation vector.
It might also be assumption (6) which was quite personal. I chose b=1 because it would mean that an "average" page, one visited exactly once on N visits of the surfer, would get a PR of 1, which seems reasonable. If you want to resolve the contradiction by lowering the value of b, you have to lower at least to b=-1, even a bit less (I was over-generous in my (f) estimate). Which would mean that loads of pages in the world have PR < -1. Not impossible (the GoogleBar would show 0 for these pages, but their "real" PR would be negative) but it would make loads of 0 PR pages...
Just in case anyone cares, right after I've read Larry and Sergey's "Anatomy..." paper for the first time I had nothing better to do than creating a couple of thousand pages just for the purpose of boosting our home page's PR. I've crawled our site and calculated the effects. Without taking inbound links into consideration, the real PR of the home page was about 300 or so. The toolbar PR went from 3 to 4 and there were no measurable differences regarding inbound links.
I thought I'm a genius and dreamt about creating gazillions of pages ... until a few months later we had a PR0.
Of course, there may have been other reasons for the PR0, but who knows...
However, I've cleaned up the site, the PR came back, I wrote a little content, the home page is PR6 now and I sleep better.
I mostly agree, but a few comments:
> (4) PR = a ln(EPR) + b
Not important, but I would replace EPR -> EPR - offset, where the offset is probably 0 or (1-d)
> a = 1/ln 6
Of course, the value of a influences your calculation and I would say (so far I don't know the exact value):
1/ln 8 < a < 1/ln 6
> (6) b >= 1
No. b < 1 (Try to measure b and you will see this.)
You have to replace d by (1-d) in your calculation and indeed this value was originally 0.15. (Of course, the replacement doesn't influence your calculation.)
However, Google currently uses a different (smaller) value and this changes your estimate.
> (f) among these (6 power 6) visits, at least (6 power 5) reach your home page (uses assumption (1)) ;
I would say that this strongly depends on the linking structure of the site.
Thus, there are are number of parameters, which taken together strongly influences your result and therefore your conclusions (even if the calculation is in priciple valid).
Each page crawled generates a very small amount of PR. But that page has to be crawled to get it.
I do not believe that Google crawls according to the previous months PR. I believe that they crawl from certain seed locations. If you have no incoming links, it cannot find you.
Even if it does put submitted URLs into the deep crawl, it will likely only crawl a few pages if it never finds an incoming link. It will never crawl all your billion pages to crank up your PR.