homepage Welcome to WebmasterWorld Guest from 54.167.41.199
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 74 message thread spans 3 pages: < < 74 ( 1 [2] 3 > >     
Google's 2 rankings & you
New patent means new way of ranking
claus




msg:48605
 12:15 am on Jul 8, 2003 (gmt 0)

Continued from earlier thread: [webmasterworld.com...]

The new Google patent

In this thread of a similar name [webmasterworld.com] zafile pointed me towards a news.com article on a new Google patent. The link in the news.com article turned out to be wrong, but i found the patent anyway. Specifically, it is this one:

6,526,440 Ranking search results by reranking the results based on local inter-connectivity

You can find the details at the US Patent & Trademark Office, Patent Full-Text and Image Database [patft.uspto.gov] - search for 6,526,440. Text from this patent - published to the public domain by USPTO - is also quoted in this post, where necessary to make a point. There are no copyrights on patent texts, but i have tried to keep quotes to a minimum anyway.

As the title suggests, this is a patent that deals with ranking and re-ranking results.

The patent was granted on February 25, 2003, and filed January 30, 2001 - so Google researchers have known about it for at least two years already. Still, a patent grant means that the source description is published. This is the reason for (as well as the "Google News" of) this post.

I have spent a few hours studying it, and it clearly has implications for users of this forum. I'll get to the nitty-gritty of it, but let me point out the major points first.

It's not an easy read. And there are 7 unknowns as well as some scope for flexibility and judgement (either by trial-and-error or by manual or automated processes). It's really interesting though.


What is it?

It's a patent. Nothing more and nothing less. A description of some procedure for doing something. This does not mean that it will ever be put to use, as lots of patents are granted and never used. Patents don't come with a release date, but some elements of the confusion we are seeing now could be explained by this.

Chances are, however, that this one will be put to use. Having spent a few hours on it, i must say that it makes some sense. It is intended to provide better and more relevant results for users of the Google SE, and at the same time (i quote the patent text here) :

... to prevent any single author of web content from having too much of an impact on the ranking value.

Sounds serious, especially for the SEO community. And it probably is, too. But don't panic. Notice that it says "too much of an" and not "any". It's still a ranking derived from links, not a random Google rank.


What does it do?

We know about the Page Rank algorithm. This is the tool that Google uses to make sure that the pages it has indexed are displayed to the user with the most important pages first. Without being too specific it simply means that, for each and every page Google calculates some value that ranks this page relative to the other pages in the index.

This is something else. Rephrase: This is the same thing plus something else. It is, essentially, a new way to order the top results for any query.


The ultra-brief three-step version:

What the new patent implies is a ranking, then a reranking, then a weighting, and then a display. It goes something like this:

1) The usual pagerank algo (or another suitable method) finds the top ranking (eg.) 1000 pages. The term for this is: the OldScore.

2) Each page in this set is then going through a new ranking procedure, resulting in the LocalScore for this page.

3) Finally, for each page, the LocalScore and the OldScore are normalized, assigned a weight, and then multiplied in order to yield the NewScore for that page.

In this process there will actually be "two ranks", or rather, there will be three: The initially ranked documents (OldScore ~ Page Rank), and the reranked documents (LocalScore ~ Local Rank). The serps will still show only one set of documents, but this set will be ranked according to the "NewScore ~ New Rank" which is (sort of) a weighed average of PR and LR.


Confused?

Don't be confused by the fancy words. It's more straightforward than it seems. In other words, this is what happens:

a) you search for a keyword or phrase - as usual
b) pagerank finds the top 1000 results (or so) - as usual
c) localrank calculates a new rank for each page - this, and the rest is new

d) each page now has two ranks (PR and LR)

e) the two sets of ranks are multiplied using some weights.
f) the multiplication gives a third rank.

g) each page now has one rank; the NewRank (sort of equal to PR times LR)

h) pages are sorted according to the NewRank
i) and finally displayed with the best "NewRanking" ones on top.

- better?


What does it mean to me then?

Well, if you are one of the few who knows all about chea.. hrm... optimizing for Google, then it means that the world just got a bit tougher. And then again, perhaps not, there are still tricks in the bag for the very, say, experienced. Nuff said, just a feeling, i will not elaborate on that.

It will become harder, it seems. If not for anything else, then only because you now have to pass not only one, but two independent ranking filters. In stead of optimizing for PR you will now have to optimize for both PR and LR.

Let's assume, as a very simple example only, that values 0,1,2 are the only values for both PR and LR: If You get a PR of 2 and a LR of 0, then the NewRank will be 0. If you get a PR of 0 you will not even make it to the top set of 1000 that will ever get a LR calculated. On the other hand, if you get a PR and a LR of 1 then you're better off than the guy having a top PR but no LR.


Got it - what's that LR thing then?

It's a devise constructed to yield better results and (repeat quote):

... prevent any single author of web content from having too much of an impact on the ranking value.

I have been looking at the patent for a while and this intention could very well be inforced by it. That is, if "authorship" is equal to "domain ownership", or "some unspecified network of affiliated authors".

Here goes:

The LocalScore, or Local Rank, is both a filter and a ranking mechanism. It only considers pages among the 1000 or so selected by the PR.

a) The first step in calculating Local Rank for a page is to locate all pages that have outbound links to this page. All pages among the top 1000 that is.

b) Next, all pages that are from the same host as<tis page or from "Similar or affiliated hosts" gets thrown away. Yes. By comparing any two documents within the set, the one having the smallest PR will always be thrown away, until there is only one document left from the (quote) "same host or similar" as the document that is currently being ranked.

Here, "same host" refers to three octets of the IP. That means the first three quarters of it. In other words, these IPs are the same host:

111.111.111.0
111.111.111.255

"Similar or affiliated hosts" refers to mirrors or other kinds of pages that (quote) "contain the same or nearly the same documents". This could be (quote) "determined through a manual search or by an automated web search that compares the contents at different hosts". Here's another patent number for the curious: 5,913,208 (June 15, 1999)

That is: Your on-site link structure means zero to LR. Linking to and from others on the same IP means zero. Near-duplicate pages means zero. Only one page from your "neigborhood", the single most relevant page, will be taken into account.

c) Now, the exact same procedure is repeated for each "host" in the set, until each "host/neigborhood" has only one page left in the set.

d) After this (tough) filtering comes another. Each of the remaining pages have a PR value that they are sorted according to. The top k pages pass, the rest gets thrown away. Here "k" is (quote) "a predetermined number (e.g., 20)."

So, although you positively know that you have 1,234 inbound links, only the top "k" of these, that are not from "your neighborhood" or even "part of the same neigborhood" will count.

e) The remaining pages are called the "BackSet". Only at this stage can the LR be calculated. It's pretty straightforward, but then again, the filtering is tough (quote, although not verbatim, but deviations keep current context):

LocalRank = SUM(i=1-k) PR(BackSet(i))m

Again, the m is one of those annoing unknowns (quote): "the appropriate value at which m should be set varies based on the nature of the OldScore values" (OldScore being PR). It is stated, however, that (quote) "Typical values for m are, for example, one through three".

That's it. Really, it is. There's nothing more to the Local Rank than this.


What about the New Rank then?

This is getting a very long post you know... Well, luckily it's simple, the formula is here, it's as public as the rest, you can't have a patent that's also a secret (quote):

NewScore(x) = (a+LocalScore(x)/MaxLS)(b+OldScore(x)/MaxOS)

x being your page
a being some weight *
b being some weight *
MaxLS being maximum of the LocalScore values, or some treshold value if this is too small
MaxOS being maximum PR for the original set (the PR set)

* Isn't this just beautiful (quote): "The a and b values are constants, and, may be, for example, each equal to one"


Wrap up

Inbound links are still important. Very much so. But not just any inbound links, rather: It is important to have inbound links spread across a variety of unrelated sources.

It could be that blogger sites on blogger.com, tripod sites, web rings, and the like will see less impact in serps from crosslinking. Mirror sites, and some other types of affiliate programs, eg. the SearchKing type, will probably also suffer.

My primary advice from the algebra right now is to seek incoming links from "quality unrelated sites" yet still sites within the same subject area. Unrelated means: Sites that are not sharing the first three quarters of an IP or are in other ways affiliated or from "same neigborhood" (at the very least not affiliated in a structured manner). Quality means what it says.

Links from direct competitors will suddently have great value, as odd as it sounds.

Note: Spam. Cloaking. Shadow domains. Need i say more? I'm sure they will fix that anti spam filter sometime, though.

Candidate for longest post ever... here goes, let's see if the mods like it.

/claus

[edited by: Brett_Tabke at 12:01 pm (utc) on July 8, 2003]

 

Chief




msg:48635
 3:25 pm on Jul 8, 2003 (gmt 0)

Just wanted to say thanks for the post claus! Very interesting and helpful.

You're the man!

creative craig




msg:48636
 3:46 pm on Jul 8, 2003 (gmt 0)

Claus, I think that I understood that... just dont ask me any questions ;)

Craig

manilla




msg:48637
 4:12 pm on Jul 8, 2003 (gmt 0)

Thanks for that Merlin - please don't tell them ...

Actually better call it the J Club - K's too obvious :-)

ciml




msg:48638
 6:09 pm on Jul 8, 2003 (gmt 0)

That's a super summary, claus.

You mentioned my view on host affiliation analysis; I don't see how it would be a great leap forward but it ought to be easy and cheap to implement so we should be prepared for it. It would be a good way of discounting links between researchers at the same university, but I'm not sure that would help quality much.

The other main aspect of the paper follows received wisdom; if a search engine can give links a contextual weighting without sacrificing efficiency, then it must be good. GoogleGuy's comment on theming [webmasterworld.com] didn't seem to agree, so maybe we're too eager to see this holy grail in practice?

Conard




msg:48639
 6:22 pm on Jul 8, 2003 (gmt 0)

ciml,
I was suprised to see GoogleGuy's reply to Marcia's question on that thread.
I had mentioned a tip I got about thisthread to her not even a week before her question.
My tip came from someone from Google as a heads up to things that are in the works for change.
I personaly wont make any major changes to my linking structure just because it will make one search engine happy or increase my ranking.
The web was a web of links to and from related and non-related sites way before Google became the only game in town and I believe it will still be that long after they are gone.

allanp73




msg:48640
 7:04 pm on Jul 8, 2003 (gmt 0)

I see a major problem with this. There are many people who use virtual hosting. Companies can host 1000s of sites owned by 1000s of different companies all on the same ip. Any linking between these sites would have zero benefit. Google would see them as all being from the same author. I can imagine the effect for these virtually hosted companies and the hosting company itself would be devastating. Probably the hosting companies would have to start buying more ip's then we are back to square one.

The idea sounds like a good way of controlling SEOs but the effect on the unaware general web business person might be devastating.

Tropical Island




msg:48641
 7:31 pm on Jul 8, 2003 (gmt 0)

Very interesting thread.

This whole issue of linking becomes a real problem for some, including us.

Here is a situation where there are three web sites:

Site one - main regional or state site with broad content about the regional or state area with advertisers who either have small sub sites on site one or have direct links to their own sites.

Site two - a local information site about a popular tourist area within the area of site one.

Site three - local business site located in the area of site two and site one.

Site one has links to both site two and site three (who is an advertiser on site one).

Site two has links to both the larger general area site one and to local business site three.

Site three has links to both site one and two in the logical sense of providing additional information to it's potential clients.

These are all natural and appropriate linking. The problem begins when all three sites are hosted by one of the USA's largest hosting companies and the webmaster contact is the same for all three. What could possible be wrong with this? There must be thousands? of situations like this. In order to penalize abusers we may end up throwing the dishes out with the dish water. This is a normal and natural way for the web to operate.

Am I missing something?

SlyOldDog




msg:48642
 7:57 pm on Jul 8, 2003 (gmt 0)

So this means I should only get external links to one page on my site?

Any pagerank through other parts of my site will most likely be discarded?

dragonlady7




msg:48643
 7:58 pm on Jul 8, 2003 (gmt 0)

>These are all natural and appropriate linking. The problem begins when all three sites are hosted by one of the USA's largest hosting companies and the webmaster contact is the same for all three. >

That sort of worries me, too. How is that supposed to be addressed? What if i own several related but distinct sites? Do I have to try to disguise that they're all mine? Can I not link from one to the other? How aggressive is that filtering? Will Google be able to tell the difference between some webmaster's personal site linking to the ones he made for his business ("here are examples of my work", for example) and some spam-farmer's company's link farm? Is there something in place to deal with that, or do we sink or swim?

Well. I suppose that's enough worrying. We'll just see, won't we?

Namaste




msg:48644
 8:32 pm on Jul 8, 2003 (gmt 0)

great work claus.

will have to check out the formula to see if the sites in the previous cycle were indeed benefiting from a better spread of inward links, or if it was something dumb like they were using H1 tags better!

Clark




msg:48645
 8:49 pm on Jul 8, 2003 (gmt 0)

A big question I have on this is best described by taking an example. Search for "anything".

1000 results. Let's say that each of these results are inner pages, no index pages. Would G search for backlinks from all pages within all 1000 domains for ANY backlink to ANY page within the other 999 DOMAINS or is it looking for backlinks DIRECTLY FROM the inner page in the SERP TO the inner page of the SERPS?

The difference in results on how this is handled is huge.

claus




msg:48646
 9:05 pm on Jul 8, 2003 (gmt 0)

Trying to catch up:

vitaplease:
>> big question was if such an additional reranking could be done on the fly

Yes i believe so. Not as in "easy" but as in "possible". Especially if you do some of the work in advance, eg. identify the "neighborhoods" during the normal update process and set some kind of "relatedness-ID" somewhere

>> try to avoid referencing Oldscore with Pagerank

Obviously you get some of the more subtle points as well as the core matter. I really do try to be brief, but this requires some words (Some readers might want to skip the next two paragraphs) :

The patent mentions the PR algorithm explicitly (yes, as an example). I have chosen to use the term because the PR algo is fully qualified to supply the OldScore as well as the initial set, and it's Googles long-term favorite. Plus, a general concern: In posting, i had a choice; should i go for "Exact" or "Understandable"? Meaning: I could write more accurate - in the extreme it would be something in-between quoting the patent text and pure algebra - but it would be harder to make any sense of for the reader. Plus, all the subtleties can be very confusing, as (here comes that "there are always more than one way..." thing again.)

I stuck to the terms people know about and can relate to, ie: PageRank being equal to OldScore, as this could be the case if Google would choose to use the PR algo to find the initial set. I see no reason that they should choose another method than the Page Rank algo, as they are calculating this already, and it fully qualifies. But, of course, there are...

>> get motivated links from pages ranking in the top of the search engine results for that search query

- yes. That was the "direct competitor" thing, only better phrased in your post ;)

dragonlady7
>> this new ranking system will cancel a lot of blog noise and so on?

Imho, not entirely cancel, rather reduce. Blogs on blogger.com IPs will be treated differently than blogs on their own separate IPs

miapage
>> that the more I learn, the more Brett's advice rings true

- i agree, haven't read all (far from it), but what i have read is second to none

>> of information etc. will do better than your typical commercial sites, no?

On first impression it seems so, but commercial sites can be far more than a basket. Customers want info too...

swerve - important point:
>> One of the most important aspects of this patent is that is very important to have inbound links that are highly-related, provided they are from external sources

Thanks a lot :) :) It was late at night when i wrote the sentence you corrected there, and what i wrote was wrong. This is the essence of what i wrote (ie the wrong part):

It is important to have inbound links spread across a variety of unrelated sources.

The term "unrelated" is 100% wrong. No doubt about it. "External sources" is clearly better. Personally, i prefer to use the term "neighborhood". The pages must be related, but they must be so without being from the same neighborhood. Why? Because this is not entirely correct either:

>> it is correct only if "unrelated" is very narrowly defined as pages in the same IP range.

The filter in the LR procedure does not only filter on IPs. IP qualifies for "same host" but the filter does not stop there, it also filters "similar pages" and "affiliated hosts". So, (definition):

The total set of: "Pages on the same three octet IP (same host), plus affiliated hosts (hosts on other three octet IPs, serving pages with same or similar content)" is what i call "the neighborhood" of a page

This is as close a description as i can think of based on the patent. It's probably not perfect, but it serves a purpose: The "neighborhood" is not a fixed size, ie. it can not be determined by IP alone.

ciml
>> host affiliation analysis; I don't see how it would be a great leap forward

I almost started another marathon post here. I'll try another way:

It can be compared to a "peer review" in some odd sense. For each page ("peer") that is being ranked ("reviewed"), only the 20 highest ranking other pages ("peers") from 20 diferent neighborhoods ("cities") will count. This means that popularity within one neighborhood is not as important as popularity within more neighborhoods.

Plus, there's the thing about being well known: In terms of academia, if the subject was biology, one vote from a professor (famous ~high PR) will count more than, say, five votes from students (less known ~low PR). Oh... and then there's reality: This example is from the ideal world. The students might very well have done something that makes them more well known than the professor (infamous ~ real high PR). But that's something else.

>> if a search engine can give links a contextual weighting without sacrificing efficiency, then it must be good.

I agree. If the context that the SE assumes is the one that the searcher is interested in. I really don't want to comment on GoogleGuy's comments, i hope you don't mind. Everybody seems to read his words somewhat differently it seems, and i don't think i'm an exception to the rule.

/claus

littlecloud




msg:48647
 9:10 pm on Jul 8, 2003 (gmt 0)

Since GG just posted that it is fine or not penalized to have multiple domains on the IP, and I have seen top 5 rankings for huge adult search terms with a ton of domains on the same block of IP's linking to each other. Would this prove google is not using this new Lr system?

swerve




msg:48648
 9:11 pm on Jul 8, 2003 (gmt 0)

1000 results. Let's say that each of these results are inner pages, no index pages. Would G search for backlinks from all pages within all 1000 domains for ANY backlink to ANY page within the other 999 DOMAINS or is it looking for backlinks DIRECTLY FROM the inner page in the SERP TO the inner page of the SERPS?

You lost me there at the end, but I think I know what you are asking. Google ranks pages, not sites or domains. Backlinks only ever count for specific pages. So it doesn't matter whether the 1000 sites are inner pages or index pages. The "LocalRank" would be based on links to your page from any of the other pages in the "initial set" (1000, in this example) for each search query. From this number of backlinks, same "neighborhood" backlinks are excluded.

[edited by: swerve at 9:21 pm (utc) on July 8, 2003]

swerve




msg:48649
 9:20 pm on Jul 8, 2003 (gmt 0)

The filter in the LR procedure does not only filter on IPs.

Agreed. My "same IP range" was a simplification. For better clarity, I should have continued your "neighborhood" analogy.

mipapage




msg:48650
 9:25 pm on Jul 8, 2003 (gmt 0)

Claus,

While my brain reforms a bit after your last post (thanks again), a question:

"similar pages"

Is this then akin to the current 'similar pages' available on Google?.. These do indeed appear to be a neighborhood of sorts, although there are a few 'outliers' in my neighborhood ;-]

claus




msg:48651
 9:55 pm on Jul 8, 2003 (gmt 0)

right, catching up slow and steady...

first, i believe that i should have written this in a larger font in my first post:

Don't Panic!

This is not something that makes all your pages invisible because they are on the same host. Your pages are found and delivered with the PR algo, just as they have always been. If your site gets returned for a query without Local Rank, it will also get returned for a query with Local Rank.

Local Rank is only a fine-tuning of the ranking done by PR. It can move you some places up or down, but it will not remove your listing from the SERPS, as this is nothing but a re-ranking of the very same SERPS.

All search results will have two ranks: The local rank and the good old pagerank. Pagerank always wins. Local Rank is built on Pagerank. It is probably less than 50% of the total so-called NewRank.

It can make a very big difference for very spammy sites, as well as for sites in industries that are clustered in odd ways and have high competition, but not relly a big difference for the kind of sites that are built by the "good book" (for lack of a better word). Example (a very simplistic one, just to make the point):

Tropical Island, let's say that your three sites perform very well because of inbound links. Let's say that you have, eg. 1,234 inbound links. This gives you a great position, say, a #1 for one of the sites.

Local Rank will now fine-tune that position. In this process, the two links from the same IP as the #1 site will be ignored. You lose two links out of 1,234. You will probably keep that #1 position.

<edit>due to averaging out, you lose only one link, see below</edit>

If you had, say, 1,000 sites on the same IP linking to the number one site, your position would drop. But not as much as you think. PageRank is always there.

You would get only, say, 234 links from Local Rank (for the sake of simplicity), but you would keep the 1,234 links in your Page Rank.

Then PR and LR would get averaged out. Assuming equal weights, you would end up with 734 inbound links and not 234, as you still keep your PR.

(1,234 + 234) / 2 = 734

This would be a major loss in a highly competitive market, but remember, all sites get treated in the same way (including your competitors), so it will even out. Quite possibly leaving the most spammy sites a bit lower on the SERPS, but still retaining quality sites at the top.

Clark
>> is it looking for backlinks DIRECTLY FROM the inner page in the SERP TO the inner page of the SERPS

Swerve has already been there i see. This does seem to be the case, as (quote from patent):

Re-ranking component 122 begins by identifying the documents in the initial set that have a hyperlink to document x.

Documents it is, not domains.

<added>No need to make a new post just for this:</added>

mipapage
I can come no closer to what it is exactly, than a direct quote:

Documents from the same host as document x tend to be similar to document x but often do not provide significant new information to the user.

...

On occasion, multiple different hosts may be similar enough to one another to be considered the same host (...). For example, one host may be a "mirror" site for a different primary host and thus contain the same documents as the primary host. Additionally, a host site may be affiliated with another site, and thus contain the same or nearly the same documents. Similar or affiliated hosts may be determined through a manual search or by an automated web search that compares the contents at different hosts.

- as you can see, it can be most anything, really. "The same, or nearly the same documents" is key i believe. Perhaps it's keyword density or such, but really, only Google knows.

/claus

SlyOldDog




msg:48652
 10:49 pm on Jul 8, 2003 (gmt 0)

Hang on a minute. Is this a joke? Pagerank is static and is independent of the SERPs position.

Local rank it seems is dependent on the query. Obviously the same 1000 pages don't appear for every query so it needs to be calculated every time a query is made.

This means your local rank (and hence new rank)is transient and also dependent on other sites in the SERPs. Is google really going to use all that processing power every time there is a keyword search?

This patent wasn't filed on April 1st was it? That might explain why it was never used.

[edit]I see it wasn't. Perhaps it should have been [/edit]

claus




msg:48653
 11:23 pm on Jul 8, 2003 (gmt 0)

SlyOldDog, AFAIK you're absolutely right regarding static PR and transient LR. Some shortcuts can probably be made to reduce the demand for processing power, but i don't know if the patent is already in use, will be so, or never will be so. It's not a joke, though, it's a patent granted by the USPTO - as to the date, that was a rhetorical question ;)

I'm about as puzzled as everybody else, all i know is what the patent can possibly do, the actual use or non-use, is only known by Google.

/claus

<edit>Changed dynamic to transient, it's more accurate</edit>

<added>There's millions of patents in the world, also good unemployed ones - i have seen no firm evidence that this is either employed or unemployed, and i have no insight in the plans or processing capacity of Google, but patent authorities have spent two years on it, so i wouldn't exactly call it a joke. It might turn out to become just another unemployed patent, but for now it seems that nobody outside Google really knows.</added>

bokesch




msg:48654
 11:56 pm on Jul 8, 2003 (gmt 0)

I bet it's safe to assume googleguy won't be commenting on this one.

GrinninGordon




msg:48655
 1:41 am on Jul 9, 2003 (gmt 0)

bokesch

GoogleGuy was around for a moment, but the money is on him not spending too much time away from the "Buy, buy, buy" screen and button on his PC (rumours of Google going public). Also I understand the real surf is still real good.

universetoday




msg:48656
 2:08 am on Jul 9, 2003 (gmt 0)

If I was Google, I'd probably compare elements of the HTML on various websites to sense if they were authored by the same person. That would avoid the whole, different domain workaround.

hurlimann




msg:48657
 2:48 am on Jul 9, 2003 (gmt 0)

Great Post Claus and the patent gives us a view of something Google may use elements of if they don't already.

1) It does not look enforcable as it is not unique. Business use this type of model daily.
2) Patents are cheap to get.
3) Google sought protection in 2001: lets say it was 1999 when they thought of this.

I doubt it will ever make a major change but it certainly gives an insight into their thinking then.

claus




msg:48658
 1:53 pm on Jul 9, 2003 (gmt 0)

For the first time since i started this thread, i've been adding some figures to the formulas, just for the kick of it - to see if what i thought would happen also did happen. It did.

Right now, i've just been playing around with the a's, b's, k's, m's, the PR, as well as the LocalScores and the NewScores in a very limited experiment: one given page ... i simply built the model in a standard spreadsheet, as i did not want to take the extra time to make it a Javascript. If the required data was there (*), i find it very improbable that the computing part would crash the (50,000?) Google Linux boxes. Seriously, this model is very good work. The trick is to bring the number of units requiring some kind of calculation down to a very small set, without losing vital (**) information (***). And this is exactly what this patent does.

In stead of considering all the billions of pages out there for each of the 1000 pages in the set, it only considers, say, 20 pages per page that needs re-ranking. It does not even compute pageranks for these 20 pages, it's just a look-up and some basic manipulation of the PR that has already been calculated. This is good stuff. D*** efficient.

Also effective, though. Or perhaps even is the word. You decide. This model can effectively make a page appear higher in the serps than another page with higher PR. And the other way round. It all depends on the related pages that are not in the immediate neighborhood of this page and their PR. One could say, that it pays off to have friends in high places. Real friends, that is, independent ones, not string puppets. That was the popular (optimist/hippie/whatever), version.

I think i'll just leave it there for now. It's all in the posts, waiting for Google to implement the patent, for us to discover it, or whatever seems to be the flavor of the day

Very nice work indeed. I'll get out and get a beer on that. Cheers all ;)

/claus


Notes:
(*) The labour-intensive part is identifying the neighborhoods, not the computation per se once this is done, as the computation is done on basis of the PageRanks, that have already been batch-created. It's a large job running through all pages recursively at fixed intervals, but the Google setup seems to handle this (type of) task well.

(**) Vital does not necessarily mean that the pages you like the most are also the ones this model would like to keep.

(***) Is "Occams razor" the right term for this?

cabbie




msg:48659
 2:26 pm on Jul 9, 2003 (gmt 0)

Sensational work Claus!For the clarity that you have given this subject you should be promoted to senior member for sure.
Thanks again.
Alan

Chris_D




msg:48660
 4:29 pm on Jul 9, 2003 (gmt 0)

Claus,
Excellent post, excelent clarification & your follow up is above & beyond the call of duty!

Really well Done.

: )

Chris_D
Sydney Australia

Xylem




msg:48661
 9:37 pm on Jul 9, 2003 (gmt 0)

So basically if you are the Webmaster of many domains and they happen to link to each other and happen to be hosted by the same hosting provider; you will be less relevant (per the patent) in Google?

Say I have a site that sells oranges, and another that sells apples on the same host. Lets say that these two sites link to each other. Those links (being in the same neighborhood (class c)) are deemed (in theory) less significant because of the LocalRank deeming it so?

If so, I see this messing with people that use the same host but have genuine inner linking from others on the same host. Wouldn’t it?

Say you use the webs most popular hosting service. Say so does someone whom links to you genuinely. Under this new patent that *real* link is less significant because of the neighborhood idea. That kind of stinks doesn’t it?

Sorry if any of this has been said already.

:)

Kirby




msg:48662
 10:41 pm on Jul 9, 2003 (gmt 0)

Thanks for all the work, Claus.

...it does not get returned in the same intitial set for your key search queries - it won't get counted at all in calculating the LocalRank for that query. So it is very important that links are highly-related to your targeted search terms.

I have industry specific links from sites around the world relating to the sale of widgets, like Los Angeles widgets, Dallas widgets, London widgets, and also links from Widgets.com. If my key search term is New York widgets, will the 1000 that are counted mostly be for that very specific term?

claus




msg:48663
 5:13 pm on Jul 10, 2003 (gmt 0)

back again...after this one, my posts on this subject will get much smaller in size... seriously, i hope this one does the trick... not sure though. It's a long one, and it has numbers and calculations - but everything should be easy. I think this one might show the point better than all the other posts... alt least i hope so.


In a thread with a similar name [webmasterworld.com], Brett_Tabke wrote:

You build a good site and you need never worry about it again. Get to the point, you don't care what they do and it doesn't affect you one way or another

I believe this is absolutely true, and even more with this patent. The patent simply honors the way the web works. Build good sites and other good sites will want to link to you. The primary and most important SEO work is making your site better and better.

This formula is robust. It's generally hard to mess with. So is the inner structure of the web itself. Good backlinks are simply a statement that your page is so good that it matters to somebody that matters to others. It's that simple, just explained by algebra and not words.

quality and quantity

Consider this statement for a minute, it's not new, but it's as accurate as ever: "a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. [www7.scu.edu.au]"

Think of it (Local/NewRank/Score) as an elaboration on "some pages", "well cited" and "places around". Only 20 votes, but taken from 1,000 possible, and only the best ones count. If they have higher PR than your page they actually improve your position.

I can have have 100 stupid, ugly, uninteresting, but very focused 10-page sites up and running with links to me and across in no time. But is that really a sign of quality?

If the 1,000 pages that link to me have only keyword density, but no incoming links from anybody important, well then, why should it matter which pages they link to? These are the kind of pages that will pull your position down. But don't worry about low PR sites linking to you either. It doesn't matter in the long run. A good orange shop can always get a good link somewhere, and there's only room for the 20 very best links in the re-ranking formula.

Doorway pages, cloaking, shadow domains... they're not really that important with this formula. Nobody important links to these pages anyway.


apples and oranges

Let's say you're #1 in oranges. Don't worry. You couldn't be so without backlinks. Let's say 1,234 backlinks. You may lose 1, but you still have lots of backlinks. This means PR. Local Rank builds on PR from those who links to you. Some of these linking pages might even have a lot higher PR than your apple page. So they get included in stead. That's good, not bad. Out of 1,000 pages in the fruit category i think you may find just a few pages with more PR than the apple page. The top 20 makes it. No problem.

>> I see this messing with people that use the same host but have genuine inner linking from others on the same host.

- it's all about providing the most relevant serps, not about who likes which host. This genuine inner linking (i guess there's a lot of it) gives high PR. PR is king. Still. Your blogger sites or your pages on tripod will still make it to the serps on the relevant KWs. Don't worry. The very relevant blogger sites or tripod sites or apple/orange sites will have links from outside. Good links. If they hadn't they would not have made it to the serps in the first place.

- unfair?

There's a ranking. And then a re-ranking. After that, the ranking and the re-ranking get averaged out. There's no new concepts, except the "neighborhood" thing. Don't get confused by it and think it's unfair. Don't think you lose influence. You (still) have the single most important influence over your site and it's rankings [webmasterworld.com]. Just build a good site. People like good sites. They link to them with *real* links. It's as simple as that, really.


New York widgets

If you do a search for New York widgets, the 1,000 results you got before this (and perhaps still, nobody knows) would be ranked by PR.

After this, you start with exactly the same sample of 1,000 sites ranked by PR. If your Dallas widgets or your LA widgets did show up in the serps for that term before, they are also part of the baseline 1,000 after.

- let's play

An example (with sort-of-random PR values for simplicity) - it does not have to be believable, but it helps explaining stuff:

Geeeeeeeeeoorge search for "New York widgets" (showing 1 to 1,000 of 1,234,567)

1. newmexicowidgets.com = pr6
2. newyorkwidgets.com = pr5
3. newyorkwidgets.com/dallaswidgets = pr4
4. lawidgets.com = pr3
5. better-newyorkwidgets.com = pr2
6. londonwidgets.com = pr1
...
1,000. notreallywidgetsbutfromnewyork.com = pr0

Now, let's assume something. It doesn't have to be true or even possible IRL but it makes it easier to play around if you assume something and don't change the assumptions underway. We'll do exactly that.

a) that you have the #2 at PR 5
b) that all top 10 sites link to you
c) that there's no way you can ever get past PR10
d) that we consider 3 sites for the LocalScore (LocalRank), not 20
e) that PR and LR get the same weights, ie: a=1 and b=1
f) that some odd factor with the odd name m is set to 2

Plus:
g) we know that in this sample the max PR is 6

--------------------
- Your domain (#2) now have 5 out of 10 possible; 5/10 = 50%

(the "5" and the "10" being PR5 and PR10)
--------------------

So, let's compute your LocalScore and the NewScore (Score = Rank), that is, the "NewRank" for #2 in the example-george-serps above:

We do that by identifying all the sites that link to #2. Rather, we know that already, because the PR algo already have found these pages. Anyway, it's 1,2,3,4,5,6. Then, some odd "thingy" shows us that #2, #3, and #5 are from the same "neighborhood", so we dump those as we do not want your own pages to influence your localrank. (after all, the links from those pages are already included in your PR - you can't escape the PR)

So, basically, we re-sort the list by something like this query: grab the first three pages off this list from the "non-neighborhood" of the page we must calculate rank for. This leaves us with the set "k" of three pages:

1. newmexicowidgets.com = pr6
4. lawidgets.com = pr3
6. londonwidgets.com = pr1

- LocalRank, here we go:

PR for each page gets squared and then the results are added:

6 * 6 = 36
3 * 3 = 9
1 * 1 = 1
-------------
36 + 9 + 1 = 46 (= LocalScore ~ Local Rank)

- NewRank, here we go:

Sidestep: Excuse me for messing with terms. I should have sticked to the term "Score" from the start, but i didn't. The term "Score" is not the same as the term "Rank", as the #1 position is your Rank, while your PR6 is your Score. Your rank gets decided by how high score you have. A "ranking" is essentially a sorting based on a "score".

I hope that you don't get too confused, when i'm referring to "NewScore" and "NewRank" as the same thing, and to "LocalRank" and "LocalScore" as the same thing too. It can be confusing, but since i started this thread using those other terms, i think it's best to continue that way.

a) NewScore(x) = (a+LocalScore(x)/MaxLS)(b+OldScore(x)/MaxOS)

b) NewScore = (1 + 46 / ooops... what's the MaxLS?

Well, the MaxLS is local score for the maximum value that is found within the set of k pages. That is (hold your breath now): If all the pages in the k set that pointed to you had the maximum possible value that all pages pointing to you in the k set in fact could have - then you would have the MaximumLocalScore.

We know that the highest ranking page linking to you has a PR of 6, so we try to figure out what would happen if all 3 pages linking to you had this high PR. We know the numbers, so we can just calculate:

(6*6)+(6*6)+(6*6) = 108 (MaxLS ~ Max Local Rank)


- here we go again: (Score ~ Rank)

a) NewScore(x) = (a+LocalScore(x)/MaxLS)(b+OldScore(x)/MaxOS)

b) NewScore = (1 + 46 / 108)(1 + 5 / 10)

(5 and 10 comes from a) and c) above)

c1) reducing (1): 46 / 108 = 0.43
c2) reducing (2): 5 / 10 = 0.50

c3) NewScore = (1 + 0.43)(1 + 0.50)

d) NewScore = 1.43 * 1.50

e) NewScore = 2.145 (~ New Rank)


Great. Just great. Before you had a PR5. Now you have a "NewScore"-something at about 2. What's the big deal, where did the rest go. You had 5 now you've lost 3 - can this be true?


Don't panic. Repeat: Don't panic.

For one thing, this happens to all pages, not just yours. If the (here comes the thing with the wording: ) "Score" gets evenly reduced for all pages, then the "Ranking" will stay the same.

Say that you "lose 3" at the #2 position (position = rank) - say the guy at #1 also loses 3, and the guy at #3... and so on. Your rank / position will be the same, only your score will be different.

The trick is that not every page will lose the same. Some will lose, some will gain. It all depends on the ran.. hrm.. "Page Rank Score" (or "OldScore" to be exact) of the pages that link to these pages.

Did you get it now? It doesn't depend on your pages. It really depends on the importance of the pages that link to you. Those that are not your own. Importance being the good old PageRank.

And.. this is the sophisticated part: The importance of the pages that link to you is decided by how many, and who, that links to those pages. It's all in the good old pagerank, so you don't lose any information, you just process (a (very) small subset of) it again and refine the (scores to get a similar (but more relevant)) ranking.


Oh, i almost forget. But i think it's still important to you. You didn't lose three. Don't lose your temper now, but in fact, you never had 5 in the first place. Remember this one?

--------------------
- Your domain (#2) now have 5 out of 10 possible; 5/10 = 50%

(the "5" and the "10" being PR5 and PR10)
--------------------

The number "5" is just a number. It's nothing without the number 10. What you really had was a score of 50%.

This turned out to be good enough for this query (on the keyphrase "New York widgets") as the site actually landed at #2. For other queries you might need, say, only 20% or perhaps 70% to get to #2.

And: You could just as well have had a PR1000, but then the PR10 would have to be PR2000. It's the distance to the top that matters, not the number itself. Numbers are nothing but numbers - it's the relation between numbers that's the interesting thing.


Still not convinced? Well, what's the maximum NewScore then? You got around 2, but is the top still at 10? I wouldn't think so, but let's calculate to feel better about it:

Okay, we know these two:

1) MaxLS is 108
2) MaxOS is 10

- we can just put these figures into the equation, and then we will know what the maximum possiple new score for your page would be:

a-2) NewScore(x) = (a+LocalScore(x)/MaxLS)(b+OldScore(x)/MaxOS)

b-2) NewScore = (1 + 108 / 108)(1 + 10 / 10)

c-2) NewScore = (1 + 1)(1 + 1)

d-2) NewScore = 2 * 2

e-2) NewScore = 4 (~ New Rank)

...

Now, what was that New Score of yours? It was 2 was it not? Look at this:

Before you had : PR5 out of PR10 : 5 / 10 = 50%

Now you have this : NS 2 out of NS 4 : 2 / 4 = 50%

Get it? Sure?
Think about it. It's more sophisticated than that. Do you realize that we actually removed your shadowdomain "better-newyorkwidgets.com" and removed your internal cloaked page "newyorkwidgets.com/dallaswidgets" and we still find you at 50%.

Then, what did these two pages contribute with? Of course you can guess, but it can be calculated very exactly. After all, it's an investment, so what's the return on it?

ROI = 50% before - 50% now = 0% contribution (*)

Zero. Nada. Zilch. That's your ROI on these pages. The only thing that mattered was that other people liked your page enough to put a link to it on their page. Isn't that just beautiful?

(*) Note: actually, the NS is 2.145 now, which yields 54% so the ROI is strictly negative. The pages may have contributed to ROI in terms of original PR, but this will reduce the value.


A few thoughts

Now, if i was evaluating patents, it would have taken me some time to figure this one out. Perhaps not two years, but some time. Then, I'd be really convinced that this patent actually did something that was hard to do in a new and better way. I'd grant it for sure.

I really have thought about this: Am I doing G a favor here, elaborating on the patent, or is it the opposite? The more i find out about what it really does, the more i tend to think that this really doesn't matter. Competitors cannot copy it as it's... well, a patent, and creative SEOs really will find it hard to cheat.

As to the question of, "is it unemployed?" i really don't know yet. I have seen evidence in other threads that could be explained by the patent, and other evidence that could point towards that the patent is not used.

It ought to give less spam in the top serps. And it ought to sometimes rank pages higher than other pages having higher PR. There have been some signs of more spam lately, but lower PR pages have also climbed.

What it does not explain is "double indexes", though (pages appearing twice or in different versions across searches and caches), as the patent draws from the one and same index. It does not explain the increased spidering frequency, or the deepbot/fresbot thing either.

The dance? Well, it could be a sign, but i wouldn't call it evidence in the strict sense. They could be messing with lots of other stuff to achieve that, bot spidering patterns and the PR algo has a wide scope for variations already.

The sign-not-evidence: Well, i suppose that if you keep adding fresh pages to serps off-sync with PR-calculations (and/or identification of "neighborhoods"), then there will be a dance of some sort, as it will be different pages that make it to the K set. Then again, this is not necessarily bad, as the (linking patterns of the) web develops 24/7/265. To stabilize, you could either limit the influence of LR (which would let more spam through) or you could avoid adding all those fresh pages to the serps until they get their PR (and neighborhood) identified (which would mean lower update frequency).

I guess it's a trade-off, and if the LR is in fact being used now, then it will settle down at some point, when they find out how much sugar or cream they prefer to add to the coffee and when (the a's, b's, k's and m's).

Anyway, i think this patent will finally let us concentrate on building better pages, and i like that idea so i hope they will implement it (if they haven't done so already). This does not mean no SEO, i have to add, only better SEO. I did not choose link #3 at random. (Not #2 either, added.)


I can add no more new knowledge now. There was a few good points in the numerical examples, but they were really just an illustration of post #1. I don't mind repeating myself a few times if there's a point to it, but i feel that you all ought to know just as much as me about it by now :)

/claus

<edit>Typos. Added ROI. Changed "link #2" to "link #3"</edit>

[edited by: claus at 7:16 pm (utc) on July 10, 2003]

Kirby




msg:48664
 6:07 pm on Jul 10, 2003 (gmt 0)

Thanks for the recap, Claus. Algebra was never my strong point.

If it works as you explain it, it fits well with Larry and Sergey's original goal, and Brett's advice holds true. It doesnt reinvent the web in Google's image as many panicked webmasters think Google is trying to do, rather it seeks to improve the integrity of the serps. Whether or not this can actually be accomplished remains to be seen.

TravelMan




msg:48665
 6:42 pm on Jul 10, 2003 (gmt 0)

Bravo Claus!

Excellent posts, thanks very much.

Lots of food for thought.

This 74 message thread spans 3 pages: < < 74 ( 1 [2] 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved