Deconstructing G's switch from "batch" to "on the fly" PR calculation

I'm figuring this out while writing it so I don't even know what the conclusion will be. Sometimes it helps to put things on "paper".

To my understanding, prior to the switch, G spidered once a month, took one month's data and calculated PR on stable unchanging data. IOW, "batch" data.

It's much easier to work with batch data than doing this on the fly because the data doesn't change while you are making calculations. So let's try to pretend to be G and see if we can figure out what has to happen now that it is switching to on the fly calculations.

Table-wise, in a simplified manner, how must G get a calculation of PR? Let's start with table "PR" and assume a clean URL that was just submitted. Has no backlinks and was never in the index. Let's assume that G starts with a PR of 1 for new sites and that anything less than 1 is due to penalties. Let's further assume for simplicity that any site linked to from another page gets a gift of half the originating site's PR value.

-----------------------------------
TABLE PR
-----------------------------------
PRID Ś PAGE Ś PR
------------------------------
1 Ś [page1.com...] Ś 1
-----------------------------------

So now Google will spider that page. Pull out all links for further spidering and assign a PR to those new pages, even before spidering the new pages... Let's assume there are 2 links on Page 1. Page 2 is not in the index, Page 3 is.

-----------------------------------
TABLE BACKLINKS
-----------------------------------
BacklinkID Ś BackLinkPAGE Ś URL
-----------------------------------
1 Ś [page1.com...] Ś [page2.com...]
2 Ś [page1.com...] Ś [page3.com...]
-----------------------------------

As a new page, Page 2 gets assigned a PR1 and let's assume Page 3 already had a PR of 2. Both Page 2 & 3 will get an extra .5 PR from Page 1 but this only happens during update time so for now TABLE PR looks like this:

-----------------------------------
TABLE PR
-----------------------------------
PRID Ś PAGE Ś PR
------------------------------
1 Ś [page1.com...] Ś 1
2 Ś [page2.com...] Ś 1 (1.5 Pending)
3 Ś [page3.com...] Ś 2 (2.5 Pending)
-----------------------------------

Now let's assume that Page 4 has a PR of 3 and both Page 2 and Page 3 link to only one page and that page is Page 4. And one more twist, Page 4 in turn links back to Page 1.

So after Page 2 and Page 3 are spidered, we know that Page 4 can expect a boost of .75 from Page 2 and either a boost of 1 from Page 3 or a boost of 1.25 if Page 3's boost gets counted before Page 4's boost gets counted. (I know, I know, it's getting complicated...) So now our table looks like this:

If you're still following, you may have already figured out that since all these urls are linked to one another, you can do unlimited passes through the data to figure out PR for each url. And I didn't even get to the update yet. So w/o further ado, let's assume the update started and we have done 1 pass through the database. And PR is passed through in order of PRID. (remember now, we have a pending PR increase for Pages 2, 3 and 4) This will be the new PRs:

-----------------------------------
TABLE PR
-----------------------------------
PRID Ś PAGE Ś PR
------------------------------
1 Ś [page1.com...] Ś 1
2 Ś [page2.com...] Ś 1.5
3 Ś [page3.com...] Ś 2.5
4 Ś [page4.com...] Ś 5 (next pass 2.5 goes to P1)
-----------------------------------

As you can see, once Page1 gets a boost from P4, it will in turn give a boost to Page 2 & 3 and unlimited passes will inflate everyone. Google has ways to handle this, but I'm just trying to get to the point that PR is always somehow an estimate and not perfect at all.

Now if you remove a url from the mix, it affects the other pages too. In the past I'm told that G would start each update with the PR values from the previous update. As it makes more passes through the database, eventually it will downwardly adjust the PR to the sites that had backlinks that are now removed.

So when you think about the what G has to go through with batch data, can you imagine what happens if it starts to make these calculations on the fly? Sheer chaos.

What basically should then happen is that each time a new url gets added to the index, the BACKLINKS table gets evaluated for PR. When a page is removed from the index, all the backlinks it pointed to need to be downwardly adjusted. And each time a page is refreshed, it's PR has to be calculated again analyzing the BACKLINKS table (these tables are simplified assumptions, but something LIKE it must be there).

Now the trick is, when you had monthly batch data, you could start with last month's old PR values and after enough iterations obtain a reasonable PR value for all urls. But now that we have switched to rolling data, G would get royally screwed if they started with anything but virgin PR values.

My theory is that the current database is a big mishmash. They took (I think) May's index as a starter. Each page got the old PR value. In the meanwhile, the new freshdeepbot starts crawling and gathering brand new backlink data and PR is calculated on the FLY with virgin data.

When you query the datacenters, your PR looks like May but backlinks are severely restricted because the PR value is only reflecting May's number, not backlinks at all. Backlinks are live data based on the new crawl.

As time goes on G is testing the new PR values against the old ones. At some point they switch over some of the newly spidered data for the old data and drop out the May PR values. SERPS are all over the place. PRs are confusing. All the while the freshdeepbot is out recrawling old pages because it is calculating VIRGIN PR while we are scratching our heads. We expect new data to be crawled but G only cares about making the new Virgin PR stable and doesn't want to go after much new content until they are sure the virgin PRs are stable.

Time passes, and the new data gets better, now some new pages get crawled. Some more old data gets dropped. We're all scratching our heads, but it's just a reflection of G transitioning to the new PR system.

If anyone is still reading this (90% probably aren't) and understood what I was getting at (9% more drop out), what do you think? Make sense?

Deconstructing G's switch from "batch" to "on the fly" PR calculation

working towards a theory to explain what's been going on

Clark

Dolemite

rfgdxm1

doc_z

bolitto

SlyOldDog

Clark

rfgdxm1

JudgeJeffries

Clark

SlyOldDog

Clark

coolasafanman

Clark

doc_z

coolasafanman

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week