Blockrank - Extrapolation - Adaptive PageRank - (deprecated) SEM Research Topics forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Blockrank - Extrapolation - Adaptive PageRank

World Wide Web Conference 2003

palace4all

5:28 am on May 28, 2003 (gmt 0)

10+ Year Member

Some interesting stuff about the WWW conference 2003.

The 12th World Wide Web conference has ended. As reported earlier, a major theme was Google becoming up to 5 times faster and how a team from Stanford University were to achieve it. To help in that feat there will be three new main ways or groups which wil help with regards to the way PageRank is to be newly calculated.

They are...

Blockrank:
On average, internal links take up to 80% of all links per page. Currently each of these internal links is calculated individually, the new method will group the internal links together and calculate them just the once, thereby gaining an increase of approximately 3 times the current calculation speed.

Extrapolation:
This deals with some pretty heavy mathematics that I’m not even going to begin getinto. Your best bet, if you are not mathematically challenged like myself lol, is to click on the link Extrapolation. Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.

Adaptive PageRank:
The final main group is Adaptive Pagerank. Quote: "PageRank of pages that have converged are not recomputed at each iteration after convergence." This suggests that pages that have a high PageRank and have been in the system for sometime may well not need to be recalculated in certain circumstances.

More informations about: http://www.stanford.edu/~sdkamvar/research.html

News sources:
http://asia.cnet.com/newstech/systems/0,39001153,39133747,00.htm

jeremy goodrich

5:32 am on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

These papers have been discussed here a few weeks ago (read them about a month ago, myself) the conferences are a great way to keep up on some stuff, but most of the goods related to SE's & PageRank tends to show up here first. :)

Quadratic extrapolation and in that paper, there are references to 'teleportation'.

Did you play dungeons & dragons as a kid, too? I know when I read a 'teleportation' into a paper on algorithms, and information retrieval, I keep thinking my wizard isn't high enough level to cast a 4th level spell yet. :) lol

Back on topic though - thanks for the link! Perhaps there is some new stuff that I haven't read yet.

[edited by: jeremy_goodrich at 5:32 am (utc) on May 28, 2003]

Dolemite

9:29 am on May 28, 2003 (gmt 0)

10+ Year Member

Extrapolation:
This deals with some pretty heavy mathematics that I’m not even going to begin getinto. Your best bet, if you are not mathematically challenged like myself lol, is to click on the link Extrapolation. Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.

Sorry, but I don't see that in the paper at all. The extrapolation method relies on no new data, just new calculations.

palace4all

10:40 am on May 28, 2003 (gmt 0)

10+ Year Member

Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.

thats means, that simply used a known factor to extrapolate the other sites... alexa do the same to calculate the traffic rankings...they computes all alexa hits to get the "known factor" to computes the rankings.

As a first step, Alexa computes the reach and number of page views for all sites on the Web on a daily basis. Source [pages.alexa.com]

The extrapolation paper explain how to computes the quatric extrapolation rank by using a known factor.

killroy

11:14 am on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Extrapolation is very simple, and the maths in that paper isn't much more complex either.

PageRank is iterative, startign at 1, each iteration brings it closer to the real value.

so if you start at 1, the first iteration yields 3 and the second 5, you can make a linear extrapolation to guess the real value to be at 7. That is the first part of the paper. Then it goes on to quadratic extrapolation, basically taking 3 consecutive deltas and extrapolating on a quadratic curve. If you've ever programmed graphical quadratic and cubic splines you'll find it very easy to follow.

I wondered why, at least the linear extrapolation, it wasn't in teh original pagerank computation, as it'S a common sense method for iterative algorithms.

SN.

ncsuk

12:05 pm on May 28, 2003 (gmt 0)

10+ Year Member

Can someone please explain all this in Enlgish so us numbnuts can understand..!

dmorison

12:06 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

All this heavy maths could be the reason behind Google Compute.

[toolbar.google.com...]

Add a critical mass of Advanced Toolbar users into the mix (so that browsing data is fed back to Google) and you can start to see where Google is going with all this.

killroy

12:19 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

To be honest, unless youre implementing your own IR systems, or are doing research in large scale textual information systems, these papers will be of little interest to you.

In case you're wodnering, they DO NOT include some sort of magic SEO pearls ;)

SN

ncsuk

12:24 pm on May 28, 2003 (gmt 0)

10+ Year Member

Still interesting though. I'm one of these wacky kids who must know everything.

You know the annoying 5 year old who says "whats that, whats this, what does that do, how does that work".

Well thats me so explain away :)

killroy

12:36 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

ok, let me try to do a "ABC of PageRank".

Very roughly speaking, page rank of a page is the number of incoming links divided by the respective numbers o links on the page linking to your page, multiplied by the PR of the source page. In other words, links from high PR pages count more, and links from apges with lots of other links on them count less.

Now the crux of the matter is that to know the PR of a page you need to know the PR of all the pages linking in, and ditto for those. It's th eclassical chicken and egg problem.

So what google does is to assume (pretend) that all pages have PR1. Then it gos to each page, and recalculates it's PR using the asumed Pr of 1 from all te other pages. Now it has a sligtly better approximation of the true PR for this page.

This is repeated about 50 to 100 times (50 in the original paper, 100 in more recent papers) for which empirical evidence (read: trial and error) has shown to yield accurate enough results (never perfect mind you.)

Now running this formula on 3 billion pages 100 times takes it's time as you can imagine.

These papers mostly deal with how to use information to skip parts of this process. The extrapolation for examply uses the fact that each iteration has a better approximation (rather than equally bad and finally, suddenly correct) to skip a few steps and get to a good anseer quicker.

The block algorithm uses the fact that a website has many internal links and few external links. In theory you could pretend each site is one page, calculate PR for each "site" as a whole, then take each site in turn and distribute te site Pr to each page. In praxis it's somewhere halfway between the traditional one and what I jsut described.

If you want it more clarified, and others are interested, I might write this up neatly and the Mods could do a locked thread or something...

Hope this helps.

SN

doc_z

4:39 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

killroy,

nice summary.

I just want to add some comments:
The solution of the iteration is indepent from the initial values. (You can easily show that for a damping factor d<1 an unique solution exists. For d=1 the situation changes because you will have a number of degenerated eigen vectors.) Also, in practice one would use the PR values of the last PR calculation as initial values to speed up the calculation.
One main advantage of the block algorithm is that you can easily introduce a parallized version.

Some comments on the standford papers (mentioned above):
1) While the calculation for d=1 corresponds to the determination of an eigen vector, it corresponds to solving a set of linear eqautions for d<1. However, they (Kamvar et. al.) always consider the calculation as determining eigen vectors. This is unnecessary complicated (and from a principle point of view not correct). They also get problems with dangling pages (dead ends) with wouldn't appear if they simply solve the equation system.
2) If you look at the improvement in CPU time for realistic scenarios (a realistic value of d and a realistic number of pages) you will find that in most of the cases they are small.
3) There are well-known (different) algorithms which are probably faster.

Clovis

5:02 pm on May 28, 2003 (gmt 0)

10+ Year Member

thanks killroy and doc...

im far from a math whiz above adding and subtracting, but that your explanation makes close enough sense.

my question is why make the process faster? other than the obvious reason (faster serps) which are already lightning fast, what is the big picture? more pages? larger index? making room for additional points in the algo?

in our world of "bigger, better, faster, more"... i find it hard to believe the "faster" is the only point of address here.

any other ideas? or am i alone here?

Dolemite

5:14 pm on May 28, 2003 (gmt 0)

10+ Year Member

Extrapolation:
This deals with some pretty heavy mathematics that I’m not even going to begin getinto. Your best bet, if you are not mathematically challenged like myself lol, is to click on the link Extrapolation. Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.

thats means, that simply used a known factor to extrapolate the other sites... alexa do the same to calculate the traffic rankings...they computes all alexa hits to get the "known factor" to computes the rankings.

OK, as long as you understand that they are not using any new data. From your initial post you made it sound like google was using traffic data..."by factoring in number of visits / traffic." This is not the case.

palace4all

5:53 pm on May 28, 2003 (gmt 0)

10+ Year Member

From your initial post you made it sound like google was using traffic data..."by factoring in number of visits / traffic." This is not the case.

you are right.. that´s what i wanted to say :) Google dont use any traffic/visit data to computes the known factor...but some thoughts.. they can use gg toolbar values...

killroy

6:05 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The point of making it faster is following. Currently the calculation, done over whatever hardware google uses for it (probably a large cluster of PCs) takes many days, over a week. That is part of the additional update process... spider, archive, calcualte pagerank, display new results.

Making it faster, from the obvious effect of cutting several days off the update cycle, and/or lettign them use less hardware, will mean by using the same tiem they could calculate several differently weighted PageRanks. The idea is to have topical pageranks, for example one that weights commercial sites heaveier, or informational sites and so on, so One could search different versions of the PR with a particular type of pages emphasized.

Ultimately, the ideal would be to have a pagerank for each user, emphasizing their favourite pages and pages benefitial to their general interests. But that, of course would require a speed up of a about 8 or 9 magnitudes (about 100 million times faster). This doesn'T seem feasable with the current system.

jeremy goodrich

6:19 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>>speed up of a about 8 or 9 magnitudes (about 100 million times faster).

The BlockRank paper outlines the speed tests they did, which has a theoretical speed increase of a factor of 5 with the same dampening factor as the original Google papers (anatomy of a large scale hypertextual search engine etc).

So this new technique, BlockRank, can also be used in conjunction with the other adaptive method or the Quadratic Extrapolation method.

If you read through the papers closely, you'll noticed a few parallels between the math techniques that speed things up with adaptive control systems, which have been used in industrial manufacturing for years.

Given a set of unkowns, the system has to approximate a better value of the end result, even though it could figure out precisely what the values are, it's faster to simply approximate. By using fuzzy sets, they can arrive at the same or close enough to the same final values that makes the end results similar enough to not lessen the index quality.

killroy

7:13 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Exatly. In fact as I pointed out, some of the ideas (linear extrapolation for example) are very simple and should've been in the original algo. when I mentioned 100 million times speed up, I was refering to what would be neccessary for a personalised PR for each searcher. Only because this is something jsut about every "pop"-news site out there mentions and which is not really feasible in teh currenty setup.

SN

Dolemite

11:32 pm on May 28, 2003 (gmt 0)

10+ Year Member

but some thoughts.. they can use gg toolbar values...

Indeed they could, a scary thought IMO. I imagine they at least collect this data even if they aren't acting on it.

The idea is to have topical pageranks, for example one that weights commercial sites heaveier, or informational sites and so on, so One could search different versions of the PR with a particular type of pages emphasized.

Another benefit is the potential for quicker pagerank updates (google dances) under the current system. Combine this with what many theorize to be the purpose behind current freshbot crawling and you could have much faster complete updates. I imagine we'll see this sort of continuous update system before topical/contextual pageranks, simply because it would be much easier to implement.

jeremy goodrich

11:46 pm on May 28, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>>I imagine we'll see this sort of continuous update system before topical/contextual pageranks, simply because it would be much easier to implement

Exactly - first, speed up the process. Then personalize it.

doc_z

12:09 am on May 29, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

my question is why make the process faster?

As already said, to run a continuous update (and to provide personalized PageRank respectively results).
See also this thread. [webmasterworld.com]

Clovis

1:40 pm on May 29, 2003 (gmt 0)

10+ Year Member

sorry to be so dense, i wasnt seeing the big picture. thanks for the insight, i can now see the benefits and expanded possibilities.