Forum Moderators: open

Message Too Old, No Replies

Deconstructing G's switch from "batch" to "on the fly" PR calculation

working towards a theory to explain what's been going on

         

Clark

3:27 am on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm figuring this out while writing it so I don't even know what the conclusion will be. Sometimes it helps to put things on "paper".

To my understanding, prior to the switch, G spidered once a month, took one month's data and calculated PR on stable unchanging data. IOW, "batch" data.

It's much easier to work with batch data than doing this on the fly because the data doesn't change while you are making calculations. So let's try to pretend to be G and see if we can figure out what has to happen now that it is switching to on the fly calculations.

Table-wise, in a simplified manner, how must G get a calculation of PR? Let's start with table "PR" and assume a clean URL that was just submitted. Has no backlinks and was never in the index. Let's assume that G starts with a PR of 1 for new sites and that anything less than 1 is due to penalties. Let's further assume for simplicity that any site linked to from another page gets a gift of half the originating site's PR value.

-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
-----------------------------------

So now Google will spider that page. Pull out all links for further spidering and assign a PR to those new pages, even before spidering the new pages... Let's assume there are 2 links on Page 1. Page 2 is not in the index, Page 3 is.

-----------------------------------
TABLE BACKLINKS
-----------------------------------
BacklinkID ¦ BackLinkPAGE ¦ URL
-----------------------------------
1 ¦ [page1.com...] ¦ [page2.com...]
2 ¦ [page1.com...] ¦ [page3.com...]
-----------------------------------

As a new page, Page 2 gets assigned a PR1 and let's assume Page 3 already had a PR of 2. Both Page 2 & 3 will get an extra .5 PR from Page 1 but this only happens during update time so for now TABLE PR looks like this:

-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
2 ¦ [page2.com...] ¦ 1 (1.5 Pending)
3 ¦ [page3.com...] ¦ 2 (2.5 Pending)
-----------------------------------

Now let's assume that Page 4 has a PR of 3 and both Page 2 and Page 3 link to only one page and that page is Page 4. And one more twist, Page 4 in turn links back to Page 1.

So after Page 2 and Page 3 are spidered, we know that Page 4 can expect a boost of .75 from Page 2 and either a boost of 1 from Page 3 or a boost of 1.25 if Page 3's boost gets counted before Page 4's boost gets counted. (I know, I know, it's getting complicated...) So now our table looks like this:

-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
2 ¦ [page2.com...] ¦ 1 (1.5 Pending)
3 ¦ [page3.com...] ¦ 2 (2.5 Pending)
4 ¦ [page4.com...] ¦ 3 (+ .75 from p2 & 1.25 from p3 Pending)
-----------------------------------

-----------------------------------
TABLE BACKLINKS
-----------------------------------
BacklinkID ¦ BackLinkPAGE ¦ URL
-----------------------------------
1 ¦ [page1.com...] ¦ [page2.com...]
2 ¦ [page1.com...] ¦ [page3.com...]
3 ¦ [page2.com...] ¦ [page4.com...]
4 ¦ [page3.com...] ¦ [page4.com...]
5 ¦ [page4.com...] ¦ [page1.com...]
-----------------------------------

If you're still following, you may have already figured out that since all these urls are linked to one another, you can do unlimited passes through the data to figure out PR for each url. And I didn't even get to the update yet. So w/o further ado, let's assume the update started and we have done 1 pass through the database. And PR is passed through in order of PRID. (remember now, we have a pending PR increase for Pages 2, 3 and 4) This will be the new PRs:

-----------------------------------
TABLE PR
-----------------------------------
PRID ¦ PAGE ¦ PR
------------------------------
1 ¦ [page1.com...] ¦ 1
2 ¦ [page2.com...] ¦ 1.5
3 ¦ [page3.com...] ¦ 2.5
4 ¦ [page4.com...] ¦ 5 (next pass 2.5 goes to P1)
-----------------------------------

As you can see, once Page1 gets a boost from P4, it will in turn give a boost to Page 2 & 3 and unlimited passes will inflate everyone. Google has ways to handle this, but I'm just trying to get to the point that PR is always somehow an estimate and not perfect at all.

Now if you remove a url from the mix, it affects the other pages too. In the past I'm told that G would start each update with the PR values from the previous update. As it makes more passes through the database, eventually it will downwardly adjust the PR to the sites that had backlinks that are now removed.

So when you think about the what G has to go through with batch data, can you imagine what happens if it starts to make these calculations on the fly? Sheer chaos.

What basically should then happen is that each time a new url gets added to the index, the BACKLINKS table gets evaluated for PR. When a page is removed from the index, all the backlinks it pointed to need to be downwardly adjusted. And each time a page is refreshed, it's PR has to be calculated again analyzing the BACKLINKS table (these tables are simplified assumptions, but something LIKE it must be there).

Now the trick is, when you had monthly batch data, you could start with last month's old PR values and after enough iterations obtain a reasonable PR value for all urls. But now that we have switched to rolling data, G would get royally screwed if they started with anything but virgin PR values.

My theory is that the current database is a big mishmash. They took (I think) May's index as a starter. Each page got the old PR value. In the meanwhile, the new freshdeepbot starts crawling and gathering brand new backlink data and PR is calculated on the FLY with virgin data.

When you query the datacenters, your PR looks like May but backlinks are severely restricted because the PR value is only reflecting May's number, not backlinks at all. Backlinks are live data based on the new crawl.

As time goes on G is testing the new PR values against the old ones. At some point they switch over some of the newly spidered data for the old data and drop out the May PR values. SERPS are all over the place. PRs are confusing. All the while the freshdeepbot is out recrawling old pages because it is calculating VIRGIN PR while we are scratching our heads. We expect new data to be crawled but G only cares about making the new Virgin PR stable and doesn't want to go after much new content until they are sure the virgin PRs are stable.

Time passes, and the new data gets better, now some new pages get crawled. Some more old data gets dropped. We're all scratching our heads, but it's just a reflection of G transitioning to the new PR system.

If anyone is still reading this (90% probably aren't) and understood what I was getting at (9% more drop out), what do you think? Make sense?

Dolemite

4:48 am on Jul 7, 2003 (gmt 0)

10+ Year Member



I hate to be in the "too early to analyze" camp but I'm not sure we can say what's happening with PR until we see new pages and new sites start to have real PR values.

As time goes on G is testing the new PR values against the old ones. At some point they switch over some of the newly spidered data for the old data and drop out the May PR values. SERPS are all over the place. PRs are confusing.

I haven't seen too much PR fluctuation. Certainly nothing to explain the changes in SERPs, both in magnitude of effect and in the correlation between rising/falling.

rfgdxm1

5:27 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We have no way of knowing that Google, while doing continuous updating, might not still do PR as a batch once a month. Remember, PR is only one part of the algo. Google could also do some PR estimating for new pages. Such as if it finds a new page from a link on a PR6 page with few links, just assume that it is a PR5 until the real PR calculations are done.

doc_z

5:40 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So far, I haven't seen any hint that Google is already calculating PR on the fly.

However, I wouldn't see a problem if they change the frequency. There is no principal difference between a monthly and a daily PR calculation (some pages are added, others are removed). In both cases they will probably start with the old PR values (respectively, a real PR of one for new pages). Of course, the final (stable) values are independent of the inital guess, but this accelerates the calculation. In case of a daily PR calculation, the time is only long enough to make a few iterations. Therefore, PR for new large sites (with a lot of levels) might change during the next days (to the final value). However, even this would be an improvement compared to a monthly update. Also, there are calculation techniques which accelerates the PR propagation.

(For those, who are interested in this topic: How long does one iteration of PR calculation take? [webmasterworld.com])

bolitto

6:03 pm on Jul 7, 2003 (gmt 0)

10+ Year Member



Speculative.

<snip>

[edited by: Marcia at 6:38 pm (utc) on July 7, 2003]

SlyOldDog

6:15 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bolitto - which theory wasn't speculation at one point?

Clark

6:25 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just a theory. No idea whatsoever if there's any truth to it. That's why I put it out here to see if experts thought it made sense or was totally on left field. I was worried that my poor way of explaining things wouldn't get through but was happy it did. Maybe that came off pretentious but I meant it more in a self-mocking way because I often express myself so poorly.

Speculative? I agree...

Google News? I guess the mods were ok with it here. That's where this kind of stuff has always gone.

[edited by: Marcia at 6:37 pm (utc) on July 7, 2003]

rfgdxm1

6:37 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Google News? I guess the mods were ok with it here. That's where this kind of stuff has always gone.

I should think that speculation on how Google is doing things is very much on topic here. I believe the idea behind the "Google News" name for the forum was to emphasize this forum was for things about Google itself, and not "how do I get my site into the top 5 for all my keywords on Google?"

JudgeJeffries

7:00 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I dont beleive G is calculating PR on the fly because 3 weeks ago I had a site that was PR3 all round. I added one high value backlink and then chaos that has not recovered. Index changed to PR5 and all other 40+ pages went to white bar where they have remained. If PR was calculated on the fly it surely would have been recalculated by now and back to normal unless anyone has a better explanation for the wierdness on my site.

Clark

7:07 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I should have added that I don't trust the toolbar PR at all. They may be applying one (or even two) PR(s) in the algo and another PR for the toolbar for all we know.

SlyOldDog

11:40 pm on Jul 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your theory is a bit of an over simplification because it hasn't taken into account the exponential nature of pagerank nor the damping factor, but in principal what you are saying makes sense.

It's pretty much the same as the way banks expand the money supply by borrowing from 1$ guy A and lending 90 cents to guy B (retaining 10% as a reserve). Guy B then deposits the 90 cents at the same bank, and the bank lends out 81 cents to guy C. And so on...

The bank ends up with a total loan book of 1/10%= 10 dollars. And all from a dollar ;) - and they charge interest on all 10 of them too.

Anyway, my point is that the money supply is generated on the fly using this principal (among others), and it is notoriously hard to predict and politicians tweaking the knobs can have all sorts of unexpected results.

In this example the money is analagous to pagerank. So I think you are right. If economists haven't pinned down factors causing problems in the economy for 200 years, I don't think dynamic pagerank will be much more predictable.

Clark

4:17 am on Jul 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nice analogy :)

coolasafanman

4:36 am on Jul 8, 2003 (gmt 0)

10+ Year Member



you might be onto something, who knows.

perhaps take your batch idea and think local vs global PR. You take all sites in category X and produce a local PR within the category of 1-10 for each site. You then rank the category as a unit as compared to all other categories on the net and rate it 1-10. you then divide.

So site Q has a local PR of 6 in category X.

Category X has a cumulative PR of 8 as compared to all other categories on the web.

Therefore site Q has a global PR of (6/10) * 8 = 4.8

As a starting point, you use DMOZ/google groups to produce each category. you then do some shtuff with checking backlinks to figure out what categories sites not listed in the directory probably fall under.

You can do this on multiple levels.. ie if dmoz has widgets>blue widgets>round blue widgets, you get local within each iteration.

something like that would speed up the process, and working under the assumption that a given search term is closely tied with a given category, the results would still be highly relevant.

Clark

9:00 am on Jul 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Funny, while writing this whole thing I was also thinking about PR per theme. I couldn't think of a way to do it...your dmoz idea is a good start..

Don't miss this thread [webmasterworld.com...] for how G is probably doing something related to local "PR" .

doc_z

10:23 am on Jul 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Techniques to accelerate the PR calculation by calculating local PR values and combining these values, already exist (they are called block techniques, Blockrank - Extrapolation - Adaptive PageRank [webmasterworld.com] thread). However, the calculation is more complicated as in the example given above (e.g. because of cross linking).

Also, the idea of calculating 'PR per theme' is well known (Topic-Sensitive PageRank was discussed in the Personalizing Pagerank [webmasterworld.com] thread).

coolasafanman

1:07 pm on Jul 8, 2003 (gmt 0)

10+ Year Member



yep ur right. nice threads.