homepage Welcome to WebmasterWorld Guest from 54.226.235.222
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google Finance, Govt, Policy and Business Issues
Forum Library, Charter, Moderators: goodroi

Google Finance, Govt, Policy and Business Issues Forum

This 42 message thread spans 2 pages: 42 ( [1] 2 > >     
New Google Patent - About reranking results
Another WebmasterWorld Exclusive!
msgraph




msg:1232430
 1:40 pm on Feb 25, 2003 (gmt 0)

Ranking search results by reranking the results based on local inter-connectivity [patft.uspto.gov]

Inventors: Krishna Bharat [searchwell.com]
Assignee: Google, Inc.

A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set.

 

swerve




msg:1232431
 1:59 pm on Feb 25, 2003 (gmt 0)

I interpret it as something like the following:

Assumption: "Initial set of documents" = top search results

If a search returns 100 results, sites 11-100 will be re-ranked. Sites that contain inbound links from sites 1-10 will receive a higher ranking than those sites which do not, all else equal.

Markus




msg:1232432
 2:14 pm on Feb 25, 2003 (gmt 0)

Interesting find. It appears that it covers some aspects of Hilltop but, somehow, it seems that it is simply an improvement for the Kleinberg algo. Google uses neither of them.

swerve




msg:1232433
 2:44 pm on Feb 25, 2003 (gmt 0)

swerve writes:
Assumption: "Initial set of documents" = top search results

Upon further reading of the patent, the above assumption is not vaild.

I believe this patent describes "Local PageRank", where 'local' means 'local to a particular search query'. Let's say that a search for "widgets" returns 1,000 results, based today's algo. Now let's pretend that those 1,000 sites represent a "mini-index", so backlinks only count from sites within the 1,000 site mini-index. So a search-query-specific (local) PageRank is calculated, and the enitre list is re-ranked.

The implication of this is that it would be really important "who" your backlinks are. If your backlinks are returned with the same search terms as your site, it will improve your ranking. This provides an incentive for seeking quality, relevant links. Links from completely irrelevant sites would be weighted lower, and so they should.

Mike12345




msg:1232434
 3:43 pm on Feb 25, 2003 (gmt 0)

Slightly off topic, but on the page about krishna bharat why does it say SM as opposed to Tm next to the google logo, am i missing something or just plain thick?

mfishy




msg:1232435
 3:44 pm on Feb 25, 2003 (gmt 0)

This is very interesting. This is the theming of links that people have talked about for sometime.

The only problem I have with this is that there are many relevant links from pages that do not appear in the exact same search query.

mfishy




msg:1232436
 3:46 pm on Feb 25, 2003 (gmt 0)

Also, this only encourages webmasters to create multiple sites for the purposes of linking. It did say, however that it would not count links form the same servers so people will have to get different hosts.

born2drv




msg:1232437
 3:56 pm on Feb 25, 2003 (gmt 0)

>>>it would not count links form the same servers so people will have to get different hosts.

Don't you mean different IP ranges (classes or whatever)? Can a bot tell two sites are from different servers if the IP's are completely different? Like instead of 211.111.111.001 and 211.111.111.002, to have 211.111.111.001 and 363.103.230.502 or something, if you know what I mean?

ciml




msg:1232438
 3:59 pm on Feb 25, 2003 (gmt 0)

Very interesting, msgraph. This document, like the Hilltop and Kleinberg's paper before, looks for a cheaper alternative to contextually sensitive PageRank by selecting a subset of the Web on which to perform link analysis. I agree with Markus that Google appear not to do this yet.

This is a quick comparison of my understanding of these methods, I post it in the hope of having an errors corrected:

PageRank calculates the importance (but not context) of pages by looking at who links to whom, recursively. Google uses PageRank with on page factors and link text to order documents.

Hilltop gets the initial search results and selects those with at least some threshold of external links as 'experts'. It cleverly selects links from the expert pages that are qualified by on-page factors and uses those links (and maybe other information) to order the results.

Kleinberg gets the initial search results and adds pages linked to and from those results. These are used to find hubs and authorities, respectively, by starting with some initial set of values and iterating the flow of forward and back links to and from authorities and hubs, respectively (a principal eigenvector like PageRank).

If I read this patent correctly (I agree with swerve's description), then it gets the initial search results and calculates the simple link popularity of each member of the set, from other members of the set. This 'local' link popularity can be used to re-order the initial results.

mfishy, people who create multiple sites for the purposes of linking are more likely to have ensured separate IP addresses (or class C ranges) than experts who happen to use the same provider (which is likely for some academic and geographical topics). The thing that worries me about these methods is the requirement for non-affiliated sources; smaller initial document sets would be easier to dent IMO.

gopi




msg:1232439
 4:03 pm on Feb 25, 2003 (gmt 0)


So a search-query-specific (local) PageRank is calculated, and the enitre list is re-ranked.

Normally Google calculates PR once in a month during the index update... This patent requires to calculate Mini query specific PR dynamically...I dont know how feasible its computationally...

Even if implemented this will very much reduce the speed in which google spits out results ( pls note Speed is one of google's strong points apart from search relevence)

Markus




msg:1232440
 4:19 pm on Feb 25, 2003 (gmt 0)

Google is definitely not going to implement any kind of query specific on-the-fly PR calculations. IMO, the patent is nothing we should bother about. Maybe, google can sell it to ask/teoma...

vitaplease




msg:1232441
 4:24 pm on Feb 25, 2003 (gmt 0)

Google is definitely not going to implement any kind of query specific on-the-fly PR calculations. IMO, the patent is nothing we should bother about..

Did not read the whole patent yet, but does not have to be on-the-fly.

Google can take the top 20.000 (Zeitgeist) single and double word search queries and do it before-hand.

This stuff gets uninteresting once search queries become three word plus and more specific IMO.

msgraph




msg:1232442
 4:29 pm on Feb 25, 2003 (gmt 0)

>>Google is definitely not going to implement any kind of query specific on-the-fly PR calculations.

Why not? Researchers at both Stanford and Google have been looking for easy ways to apply this method for the past couple years. It is definitely something they show interest in.

msgraph




msg:1232443
 4:31 pm on Feb 25, 2003 (gmt 0)

>>>>Google can take the top 20.000 (Zeitgeist) single and double word search queries and do it before-hand.

Unlike HITS, [11] suggested that importance scores be precomputed offline for every possible text query, but the enormous number of possibilities makes this approach difficult to scale.

[citeseer.nj.nec.com...]

binki




msg:1232444
 4:35 pm on Feb 25, 2003 (gmt 0)

Slightly off topic, but on the page about krishna bharat why does it say SM as opposed to Tm next to the google logo, am i missing something or just plain thick?

IANAL, but I think SM just means service mark- it's like a trademark but used to distinguish a service (as opposed to a product) from its competitors.

Grumpus




msg:1232445
 4:42 pm on Feb 25, 2003 (gmt 0)

The patent calls several times for "a predetermined number of pages" to have the calculations run on. If there are 10,000 results, it'll see which of those 10,000 results link to the first 100 pages (or whatever they pick) and those first 100 pages will be shuffled on the fly. If you are result 101 for that term, you'll stay right there at 101.

At least that's how I see it based upon what's there.

G.

gopi




msg:1232446
 4:42 pm on Feb 25, 2003 (gmt 0)

As vita suggested , maybe they can mix and match ....

They can take a list of say 100,000 Top kw's from their log and also another 10,000 hyper competitive ( which means normally spammy ) kw's ( eg: dietpills/credit cards/casinos ) and precalcultate query specific PR for those kw's offline .

Finding ultra competitive kw's is not at all a problem , they can just use their adword data ...just grab the top 10,000 high priced kw's :) ....this also scales well

And for the remaining they can just use the traditional algo ... A benefitial side effect of this strategy for google is it will confuse the SEO's also :) ...you dont know which algo google use for which kw's :)

Grumpus




msg:1232447
 4:48 pm on Feb 25, 2003 (gmt 0)

If you look at the bottom as they outline how the technology is delivered, that LR (Local Rank) IS calculated on the fly.

12. A system comprising:

a server connected to a network, the server receiving search queries from users via the network, the server including:

at least one processor;

a database of a corpus; and

a memory operatively coupled to the processor, the memory storing program instructions that when executed by the processor, cause the processor to: generate an initial list of relevant documents from the corpus based on a matching of terms in the search query to the corpus, rank the generated list of documents to obtain a relevance score value for each document in the generated list of documents, calculate a local score value for the documents in the generated list, the local score value quantifying an amount that the documents are referenced by other documents in the generated list of documents, and refine the relevance score values for the documents in the generated list based on the local score values.

G.

kapow




msg:1232448
 4:50 pm on Feb 25, 2003 (gmt 0)

> it would not count links form the same servers so people will have to get different hosts.

Can someone clear this up?
- Is different IPs (with different C class) enough?
- Can Google, would Google know/care if sites are on the same server?
- Would Whois data play a role here? e.g. if sites have (or had) same ownership/administration.

ciml




msg:1232449
 4:58 pm on Feb 25, 2003 (gmt 0)

I completely agree with msgraph, the point of these schemes is mostly to make it cheaper to put some context into link analysis (that means making it quicker). PageRank runs on more than three billion URLs; these methods need to require relatively few initial results before they can become viable at runtime.

kapow, the answer to those questions would be Google's choice at the time of implementing such a system.

Grumpus




msg:1232450
 5:28 pm on Feb 25, 2003 (gmt 0)

I think the system is in effect now - and has been since Christmas or slightly earlier.

it would not count links form the same servers so people will have to get different hosts.

Clearing it up: This whole thing devalues doorways in general. In order for your doorway to pass LR (Local Rank) it has to rank on its own. If it ranks on its own, it must have PR and have other pages linking to it. Since the only way to get PR to the page quickly (without going on a linking campaign) is to link to it from your site. In order to link to it from your sites in several different areas, you trigger the "affiliate host" flag (see sections 2 and 3 of the patent). IP is irrelevant. I'd imagine that flag is pretty broad - any hint of cross referencing within that term is going to trigger it. Afterall, if the page IS really good, then there will be plenty of other pages to give it its LR bonus.

G.

xerxes




msg:1232451
 6:15 pm on Feb 25, 2003 (gmt 0)

The patent application is found at US Patent Office, go to search for Inventor Name (the application number is not correct or not found) enter Bharat, Krishna and you get all the patent information. Now I have to study this, not that I pretend to be able to understand it yet until you senior forum members explain it to us newer members.

Liane




msg:1232452
 6:38 pm on Feb 25, 2003 (gmt 0)

In my opinion, section 12 is the crux of the thing and sounds like a winning concept for "refine my search" to me! Go Google! :)

mbennie




msg:1232453
 6:41 pm on Feb 25, 2003 (gmt 0)

So in a nutshell, relevancy of the backlinks will now give those links more weight.

Is this correct?

Grumpus




msg:1232454
 6:49 pm on Feb 25, 2003 (gmt 0)

Yes. In a nutshell.

Step 1: Google ranks like it always has.
Step 2: Google calculates LR from within those results
Step 3: Google resorts and gives you your results.

Based upon what I can see (there IS a formula missing in that patent page right now, so you can't tell for sure). The LR calculation plays equal importance to ranking (at least in the top X results) as all the other factors combined - but you still need all those other factors to get into the results in the first place.

G.

xerxes




msg:1232455
 7:07 pm on Feb 25, 2003 (gmt 0)

Studying paragraph 0036 of the patent application, it looks to me (I am the village idiot) as though I need an average of 13 unique IP visitors a day to the site as well as a daily average of having the site accessed 64.5 times. We are covered on the 13 unique IPs daily, not sure about the 64.5 daily accesses to main page or other page of site. Paragraph 0020 centers on usage statistics. Are these coming from Alexa, do you think? Or can Google actually enter our raw logs? (I told you I am the village idiot!)

Markus




msg:1232456
 7:21 pm on Feb 25, 2003 (gmt 0)

Once again, don't bother so much about this patent. It appears to deal with a specific problem of the Kleinberg algo. The Kleinberg algo (used by i.e. Teoma) is based on small subsets of the web. There, you have the problem that mirrors or pages on the same host which contain the same keywords can extremely inflate the rank of certain pages. Bharat has published a lot on the Kleinberg algo and possible modifications in the past (topic drift etc.). Most of his work is in no way related to PageRank. Indeed, the problem which the patent solves does hardly exist when you base search results on PageRank.

IMO, the patent is in no way related to Google's ranking techniques. Bharat is the inventor. He has been working on a lot of different things. Google is simply assignee because Bharat works for Google and in his contract with Google there probably is a paragraph which states that all his inventions belong to Google. (I don't know how all this works in the US.) Google being assignee doesn't mean anything. Let's not create any new myths...

BTW:
Filed: January 30, 2001

Isn't this pretty old stuff?

Mohamed_E




msg:1232457
 7:33 pm on Feb 25, 2003 (gmt 0)

> Filed: January 30, 2001

Small detail, Markus :)

More seriously, thanks for pointing it out.

Namaste




msg:1232458
 7:41 pm on Feb 25, 2003 (gmt 0)

this technique appears to be in use since Nov. 02.

Grumpus




msg:1232459
 7:46 pm on Feb 25, 2003 (gmt 0)

Yup. It often takes a while for a patent to go through the application process. I started noticing that this type of ranking (or something close to it - I was thinking it was link text, not the whole page that links to it) around Christmas time. November is likely when it started and no one really noticed it.

G.

This 42 message thread spans 2 pages: 42 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google Finance, Govt, Policy and Business Issues
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved