Forum Moderators: open

Message Too Old, No Replies

Topic-Sensitive PageRank

To appear in WWW-2002

         

msgraph

3:44 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To appear in Proceedings of the Eleventh International World Wide Web Conference, 2002.

Topic-Sensitive PageRank [dbpubs.stanford.edu]

msgraph

3:54 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just in case you were wondering, Taher was an intern at Google during the summer of 2000. He currently works with Jeff Ullman at Stanford, on data mining, who is a top tech advisor to Google.

Brett_Tabke

4:00 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'd leave an actual reply here - but I'm too busy reading the algorithmic explanation behind "Themes" ;)

Killer find MS.

Chris_R

4:06 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Geat paper - need to sit down and read it. I do note that they excluded Adult from the top level categories in the ODP.

msgraph

4:08 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No problem. I love it when these researchers reveal juicy info.

I would like to thank Glen Jeh and Professor Jennifer Widom for several useful discussions.

I also suggest we go back to this....

[webmasterworld.com...]

click watcher

4:10 pm on Feb 12, 2002 (gmt 0)



fabulous

thanks for sharing this.

agerhart

4:13 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Another great find....need to dedicate time to read it.

msgraph

4:17 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I also like how they are taking on Kleinberg's HITS. Query dependency was one of HITS's main features and many saw it as a way to beat out PageRank. The only problem was that it ate up too many resources. I'm going to have to read further into this to see how they want to handle it.

Brett_Tabke

4:54 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Notice they never mention the U of T's TOPIC (patent pending) ;)

amoore

6:10 pm on Feb 12, 2002 (gmt 0)

10+ Year Member



The method described in that paper uses the top levels of the Open Directory Project to define the topics in which different PageRanks are computed. I wonder what your reactions to that are in a few different areas.

With the complaints about the ODP are you concerned to see the ODP used in this way? Granted it was just a choice of data for the research, but it is highly likely that any major moden search engine that wanted to implement a method similar to this would use the ODP in this way. It is a natural choice, and doesn't seem to have much competition in this kind of decision. Does this concern you, or do you think that the ODP is suitable for this job? Do you see an alternative?

Although the ODP is bult using thousands of volunteers, which reduces the effect of any small number of malicious or rogue editors, the choice of the top level topics is not actively maintained by a large group. In fact, it was chosen once and does not seem to change. Do you think this artificial sectioning of the web significantly affects the topic-sensitive pagerank calculation?

If this type of calculation begins to be used by a major search engine, what changes do you anticipate making in your page and site optimixations? For instance, do you anticipate being able to affect the rankings of other pages by changes you make on your sites listed in the ODP? Will you attempt to rank somewhat highly in each of the topic areas, or very highly in one or two?

I'm genuinely interested in the reactions to seeing the ODP used in this way, and I hope that the way that I've asked some of these questions doesn't make my personal opinions too obvious, or guide your reactions.

Slud

7:25 pm on Feb 12, 2002 (gmt 0)

10+ Year Member



Hopefully, this will result in more sites wanting to list themselves in the most relevant ODP category instead of just trying to get as close to the top level as they can.

bird

7:47 pm on Feb 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you think this artificial sectioning of the web significantly affects the topic-sensitive pagerank calculation?

Every possible sectioning of the web will necessarily be artificial. The only thing you can do to reduce the potientally negative effects of this artificiality is to chose a reference source with the largest possible selection of sites. The more sites you take into account, the smaller the distortion caused by any misplaced listings will become. Under that aspect, the ODP is the best choice not only for research purposes, independently of what you or me may be thinking about its quality in detail.

what changes do you anticipate making in your page and site optimisations?

None. As Slud points out, the only important thing will be to get your site listed in the correct branch of the ODP, or to get links from sites who are listed there. The calculation of the topic specific PageRank values is completely independent of the actual contents of your site. This is really no different to the current situation. If you have eg. a gaming site, you probably should not worry about (or even optimize for) ranking high under health related keywords.

greektomi

10:49 pm on Feb 12, 2002 (gmt 0)



Ok I downloaded the pdf and I'm going through it, but I'm not quite sure I can decipher this puppy i.e. what the heck is a stochastic transition matrix? A new Keanu Reeves movie :) ? Would somebody mind breaking this down into terms I can understand. How could this impacy the future of search? What will its impact be on SEO for google?

P.S. Couldn't find stochastic transitin matrix in the webmaster world glossary.

Greektomi

grnidone

11:00 pm on Feb 12, 2002 (gmt 0)



Here's an interesting link from this site...All sortsa stuff..

[almaden.ibm.com...]

grnidone

12:59 am on Feb 13, 2002 (gmt 0)



Can someone please tell me what 'IR' stands for?

chiyo

1:18 am on Feb 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok I downloaded the pdf and I'm going through it, but I'm not quite sure I can decipher this puppy i.e. what the heck is a stochastic transition matrix?

Can someone please tell me what 'IR' stands for?

I also will wait for a smarter more savvy WMW guy to translate this all for me or else my whole Wednesday will disappear in a puff of smoke!

bird

1:50 am on Feb 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



stochastic transition matrix

A stochastic matrix is the transition matrix for a finite Markov chain, also called a Markov matrix. Elements of the matrix must be real numbers in the closed interval [0, 1].

[google.com...]
Take your pick (I just looked at the first hit) and try to ignore Mr. Markov.

In this context, it's simply the mathematical representation of the linking relations between all crawled pages (a links to b, yes=1 or no=0).

IR

From the introduction on the first page of that paper, I'd assume they mean "Importance Ranking". It's not really explicit though, maybe one of the referenced papers introduces the acronym more formally.
<added>They seem to use this for on-page factors, as opposed to PageRank, which is determined by off-page factors</added>

Everyman

1:56 am on Feb 13, 2002 (gmt 0)



There are lots of ways that theme-orientation can be used to drive a search. One way is the on-the-fly clustering of Vivisimo. Google says it is working on some sort of clustering algo.

Another way is to do an initial SERP based on your search terms, but then every time you click on a link to see a page, the keywords from that description of the link are added via a loose vector to your search terms. When you return to the SERP page after checking out the link, the listing has already rearranged itself to take advantage of this new information and rerank the results based on this new specificity.

In fact, the theme-oriented approach is so rich with ranking possibilities that it makes PageRank itself look like a straightjacket.

It's no secret that Google is deficient in theme-recognition, and I'd be surprised if Open Directory hierarchies are the only tool being considered.

msgraph

1:56 am on Feb 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



IR scores = Information Retrieval scores

Think of it as how some search engines rank your pages depending on the weight of your keywords in relation to the document.

If I'm wrong then please correct me

stochastic transition matrix

I don't know. I just skipped over that part. :)

bird

2:36 am on Feb 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



(a links to b, yes=1 or no=0)

Oops, just noticed that we're talking about real numbers. So the value of the field a,b in the matrix is 1 only if a links to b, and page a contains no other links. If page a links to two pages b and c, then both the fields a,b and a,c have a value of 1/2.

Imagine that matrix as a huge table, which contains the information about how much of the PageRank of each page will get transferred to each other page (modulo some damping factor).

diddlydazz

3:04 am on Feb 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Imagine that matrix as a huge table, which contains the information about how much of the PageRank of each page will get transferred to each other page (modulo some damping factor).

Imagine going through all those results by hand to find a spam technique :)

Great Find !!!!!

Dazz

grnidone

4:14 am on Feb 13, 2002 (gmt 0)



OK...for those of you struggling, here is some help:

[mathforum.org...]

Hey, Calc class was a *long* time ago for me...

In particular:
The upside down U (unions and intersections) [mathforum.org...]
The upside down A, backward E and forward E [mathforum.org...]
Matrix: [mathforum.org...]
Matrix2: [mathforum.org...]
Matrix3: [mathforum.org...]

grnidone

4:45 am on Feb 13, 2002 (gmt 0)



OK...one more...The ¦ in between two letters:


The conditional probability of an event B in relationship to an event A is the probability that event B occurs given that event A has already occurred. The notation for conditional probability is:
P(B¦A)

[mathgoodies.com...]

lazerzubb

4:02 pm on Feb 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Smell's like a winner of the Google contest price.

alex_h

5:56 pm on Feb 13, 2002 (gmt 0)

10+ Year Member



More Pagerank...

Got a friend from school, working on her Masters at MIT in Computer Science with some side work on OR (operations research)...

She's always been better at math than me... Looks like I'm taking her out for dinner again with a print out of the PDF... :)

vitaplease

11:06 am on Feb 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nicest article I have seen in months MSGRAPH.

It seems the taste of the day is "offline rank computation".

Interesting is that the bias factor "alpha" for the topic sensitive pagerank can be very low (0.05 to 0.25) to give satisfactory results. It looks like links from directories just listing companies in non-topic groupings will be worthless in future.

Chapter 6 on Ongoing work, states that they would like to refine to lower levels of ODP for better topic sensitivity. All very fine for the "English" language. But how will they do this for other languages which have limited representation within ODP?

Example: A scientific article in french citing an article in German and English will need clever trans-lingual topic translation clustering. One option would be to use a technolgy a word translation company is doing;
[euroglotonline.nl...]

Every meaning of a word (also every form of a verbs morphology) has a unique id-number which relates to its counterparts in all other languages, which means you can translate "I search" from English to German to Dutch and back to English and you should still revert back to "I search" beacause they all have the same Id-number (you need the purchased version to check that easily).

(i have no affiliation to Euroglot)

msgraph

4:53 pm on Feb 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here is another one to appear in WWW-2002 by the same author .

Evaluating Strategies for Similarity Search on the Web [dbpubs.stanford.edu]

It's a new version of what was published back in Feb. 2001.

Similarity Search on the Web: Evaluation and Scalability Considerations [dbpubs.stanford.edu]