Forum Moderators: open
To my understanding, prior to the switch, G spidered once a month, took one month's data and calculated PR on stable unchanging data. IOW, "batch" data.
It's much easier to work with batch data than doing this on the fly because the data doesn't change while you are making calculations. So let's try to pretend to be G and see if we can figure out what has to happen now that it is switching to on the fly calculations.
Table-wise, in a simplified manner, how must G get a calculation of PR? Let's start with table "PR" and assume a clean URL that was just submitted. Has no backlinks and was never in the index. Let's assume that G starts with a PR of 1 for new sites and that anything less than 1 is due to penalties. Let's further assume for simplicity that any site linked to from another page gets a gift of half the originating site's PR value.
-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
-----------------------------------
So now Google will spider that page. Pull out all links for further spidering and assign a PR to those new pages, even before spidering the new pages... Let's assume there are 2 links on Page 1. Page 2 is not in the index, Page 3 is.
-----------------------------------
TABLE BACKLINKS
-----------------------------------
BacklinkID ¦ BackLinkPAGE ¦ URL
-----------------------------------
1 ¦ [page1.com...] ¦ [page2.com...]
2 ¦ [page1.com...] ¦ [page3.com...]
-----------------------------------
As a new page, Page 2 gets assigned a PR1 and let's assume Page 3 already had a PR of 2. Both Page 2 & 3 will get an extra .5 PR from Page 1 but this only happens during update time so for now TABLE PR looks like this:
-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
2 ¦ [page2.com...] ¦ 1 (1.5 Pending)
3 ¦ [page3.com...] ¦ 2 (2.5 Pending)
-----------------------------------
Now let's assume that Page 4 has a PR of 3 and both Page 2 and Page 3 link to only one page and that page is Page 4. And one more twist, Page 4 in turn links back to Page 1.
So after Page 2 and Page 3 are spidered, we know that Page 4 can expect a boost of .75 from Page 2 and either a boost of 1 from Page 3 or a boost of 1.25 if Page 3's boost gets counted before Page 4's boost gets counted. (I know, I know, it's getting complicated...) So now our table looks like this:
-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
2 ¦ [page2.com...] ¦ 1 (1.5 Pending)
3 ¦ [page3.com...] ¦ 2 (2.5 Pending)
4 ¦ [page4.com...] ¦ 3 (+ .75 from p2 & 1.25 from p3 Pending)
-----------------------------------
-----------------------------------
TABLE BACKLINKS
-----------------------------------
BacklinkID ¦ BackLinkPAGE ¦ URL
-----------------------------------
1 ¦ [page1.com...] ¦ [page2.com...]
2 ¦ [page1.com...] ¦ [page3.com...]
3 ¦ [page2.com...] ¦ [page4.com...]
4 ¦ [page3.com...] ¦ [page4.com...]
5 ¦ [page4.com...] ¦ [page1.com...]
-----------------------------------
If you're still following, you may have already figured out that since all these urls are linked to one another, you can do unlimited passes through the data to figure out PR for each url. And I didn't even get to the update yet. So w/o further ado, let's assume the update started and we have done 1 pass through the database. And PR is passed through in order of PRID. (remember now, we have a pending PR increase for Pages 2, 3 and 4) This will be the new PRs:
-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
2 ¦ [page2.com...] ¦ 1.5
3 ¦ [page3.com...] ¦ 2.5
4 ¦ [page4.com...] ¦ 5 (next pass 2.5 goes to P1)
-----------------------------------
As you can see, once Page1 gets a boost from P4, it will in turn give a boost to Page 2 & 3 and unlimited passes will inflate everyone. Google has ways to handle this, but I'm just trying to get to the point that PR is always somehow an estimate and not perfect at all.
Now if you remove a url from the mix, it affects the other pages too. In the past I'm told that G would start each update with the PR values from the previous update. As it makes more passes through the database, eventually it will downwardly adjust the PR to the sites that had backlinks that are now removed.
So when you think about the what G has to go through with batch data, can you imagine what happens if it starts to make these calculations on the fly? Sheer chaos.
What basically should then happen is that each time a new url gets added to the index, the BACKLINKS table gets evaluated for PR. When a page is removed from the index, all the backlinks it pointed to need to be downwardly adjusted. And each time a page is refreshed, it's PR has to be calculated again analyzing the BACKLINKS table (these tables are simplified assumptions, but something LIKE it must be there).
Now the trick is, when you had monthly batch data, you could start with last month's old PR values and after enough iterations obtain a reasonable PR value for all urls. But now that we have switched to rolling data, G would get royally screwed if they started with anything but virgin PR values.
My theory is that the current database is a big mishmash. They took (I think) May's index as a starter. Each page got the old PR value. In the meanwhile, the new freshdeepbot starts crawling and gathering brand new backlink data and PR is calculated on the FLY with virgin data.
When you query the datacenters, your PR looks like May but backlinks are severely restricted because the PR value is only reflecting May's number, not backlinks at all. Backlinks are live data based on the new crawl.
As time goes on G is testing the new PR values against the old ones. At some point they switch over some of the newly spidered data for the old data and drop out the May PR values. SERPS are all over the place. PRs are confusing. All the while the freshdeepbot is out recrawling old pages because it is calculating VIRGIN PR while we are scratching our heads. We expect new data to be crawled but G only cares about making the new Virgin PR stable and doesn't want to go after much new content until they are sure the virgin PRs are stable.
Time passes, and the new data gets better, now some new pages get crawled. Some more old data gets dropped. We're all scratching our heads, but it's just a reflection of G transitioning to the new PR system.
If anyone is still reading this (90% probably aren't) and understood what I was getting at (9% more drop out), what do you think? Make sense?
As time goes on G is testing the new PR values against the old ones. At some point they switch over some of the newly spidered data for the old data and drop out the May PR values. SERPS are all over the place. PRs are confusing.
I haven't seen too much PR fluctuation. Certainly nothing to explain the changes in SERPs, both in magnitude of effect and in the correlation between rising/falling.
However, I wouldn't see a problem if they change the frequency. There is no principal difference between a monthly and a daily PR calculation (some pages are added, others are removed). In both cases they will probably start with the old PR values (respectively, a real PR of one for new pages). Of course, the final (stable) values are independent of the inital guess, but this accelerates the calculation. In case of a daily PR calculation, the time is only long enough to make a few iterations. Therefore, PR for new large sites (with a lot of levels) might change during the next days (to the final value). However, even this would be an improvement compared to a monthly update. Also, there are calculation techniques which accelerates the PR propagation.
(For those, who are interested in this topic: How long does one iteration of PR calculation take? [webmasterworld.com])
Speculative? I agree...
Google News? I guess the mods were ok with it here. That's where this kind of stuff has always gone.
[edited by: Marcia at 6:37 pm (utc) on July 7, 2003]
I should think that speculation on how Google is doing things is very much on topic here. I believe the idea behind the "Google News" name for the forum was to emphasize this forum was for things about Google itself, and not "how do I get my site into the top 5 for all my keywords on Google?"
It's pretty much the same as the way banks expand the money supply by borrowing from 1$ guy A and lending 90 cents to guy B (retaining 10% as a reserve). Guy B then deposits the 90 cents at the same bank, and the bank lends out 81 cents to guy C. And so on...
The bank ends up with a total loan book of 1/10%= 10 dollars. And all from a dollar ;) - and they charge interest on all 10 of them too.
Anyway, my point is that the money supply is generated on the fly using this principal (among others), and it is notoriously hard to predict and politicians tweaking the knobs can have all sorts of unexpected results.
In this example the money is analagous to pagerank. So I think you are right. If economists haven't pinned down factors causing problems in the economy for 200 years, I don't think dynamic pagerank will be much more predictable.
perhaps take your batch idea and think local vs global PR. You take all sites in category X and produce a local PR within the category of 1-10 for each site. You then rank the category as a unit as compared to all other categories on the net and rate it 1-10. you then divide.
So site Q has a local PR of 6 in category X.
Category X has a cumulative PR of 8 as compared to all other categories on the web.
Therefore site Q has a global PR of (6/10) * 8 = 4.8
As a starting point, you use DMOZ/google groups to produce each category. you then do some shtuff with checking backlinks to figure out what categories sites not listed in the directory probably fall under.
You can do this on multiple levels.. ie if dmoz has widgets>blue widgets>round blue widgets, you get local within each iteration.
something like that would speed up the process, and working under the assumption that a given search term is closely tied with a given category, the results would still be highly relevant.
Don't miss this thread [webmasterworld.com...] for how G is probably doing something related to local "PR" .
Also, the idea of calculating 'PR per theme' is well known (Topic-Sensitive PageRank was discussed in the Personalizing Pagerank [webmasterworld.com] thread).