| This 74 message thread spans 3 pages: < < 74 ( 1 2  ) || |
|Google's 2 rankings & you|
New patent means new way of ranking
| 12:15 am on Jul 8, 2003 (gmt 0)|
Continued from earlier thread: [webmasterworld.com...]
The new Google patent
In this thread of a similar name [webmasterworld.com] zafile pointed me towards a news.com article on a new Google patent. The link in the news.com article turned out to be wrong, but i found the patent anyway. Specifically, it is this one:
|6,526,440 Ranking search results by reranking the results based on local inter-connectivity |
You can find the details at the US Patent & Trademark Office, Patent Full-Text and Image Database [patft.uspto.gov] - search for 6,526,440. Text from this patent - published to the public domain by USPTO - is also quoted in this post, where necessary to make a point. There are no copyrights on patent texts, but i have tried to keep quotes to a minimum anyway.
As the title suggests, this is a patent that deals with ranking and re-ranking results.
The patent was granted on February 25, 2003, and filed January 30, 2001 - so Google researchers have known about it for at least two years already. Still, a patent grant means that the source description is published. This is the reason for (as well as the "Google News" of) this post.
I have spent a few hours studying it, and it clearly has implications for users of this forum. I'll get to the nitty-gritty of it, but let me point out the major points first.
It's not an easy read. And there are 7 unknowns as well as some scope for flexibility and judgement (either by trial-and-error or by manual or automated processes). It's really interesting though.
What is it?
It's a patent. Nothing more and nothing less. A description of some procedure for doing something. This does not mean that it will ever be put to use, as lots of patents are granted and never used. Patents don't come with a release date, but some elements of the confusion we are seeing now could be explained by this.
Chances are, however, that this one will be put to use. Having spent a few hours on it, i must say that it makes some sense. It is intended to provide better and more relevant results for users of the Google SE, and at the same time (i quote the patent text here) :
|... to prevent any single author of web content from having too much of an impact on the ranking value. |
Sounds serious, especially for the SEO community. And it probably is, too. But don't panic. Notice that it says "too much of an" and not "any". It's still a ranking derived from links, not a random Google rank.
What does it do?
We know about the Page Rank algorithm. This is the tool that Google uses to make sure that the pages it has indexed are displayed to the user with the most important pages first. Without being too specific it simply means that, for each and every page Google calculates some value that ranks this page relative to the other pages in the index.
This is something else. Rephrase: This is the same thing plus something else. It is, essentially, a new way to order the top results for any query.
The ultra-brief three-step version:
What the new patent implies is a ranking, then a reranking, then a weighting, and then a display. It goes something like this:
1) The usual pagerank algo (or another suitable method) finds the top ranking (eg.) 1000 pages. The term for this is: the OldScore.
2) Each page in this set is then going through a new ranking procedure, resulting in the LocalScore for this page.
3) Finally, for each page, the LocalScore and the OldScore are normalized, assigned a weight, and then multiplied in order to yield the NewScore for that page.
In this process there will actually be "two ranks", or rather, there will be three: The initially ranked documents (OldScore ~ Page Rank), and the reranked documents (LocalScore ~ Local Rank). The serps will still show only one set of documents, but this set will be ranked according to the "NewScore ~ New Rank" which is (sort of) a weighed average of PR and LR.
Don't be confused by the fancy words. It's more straightforward than it seems. In other words, this is what happens:
a) you search for a keyword or phrase - as usual
b) pagerank finds the top 1000 results (or so) - as usual
c) localrank calculates a new rank for each page - this, and the rest is new
d) each page now has two ranks (PR and LR)
e) the two sets of ranks are multiplied using some weights.
f) the multiplication gives a third rank.
g) each page now has one rank; the NewRank (sort of equal to PR times LR)
h) pages are sorted according to the NewRank
i) and finally displayed with the best "NewRanking" ones on top.
What does it mean to me then?
Well, if you are one of the few who knows all about chea.. hrm... optimizing for Google, then it means that the world just got a bit tougher. And then again, perhaps not, there are still tricks in the bag for the very, say, experienced. Nuff said, just a feeling, i will not elaborate on that.
It will become harder, it seems. If not for anything else, then only because you now have to pass not only one, but two independent ranking filters. In stead of optimizing for PR you will now have to optimize for both PR and LR.
Let's assume, as a very simple example only, that values 0,1,2 are the only values for both PR and LR: If You get a PR of 2 and a LR of 0, then the NewRank will be 0. If you get a PR of 0 you will not even make it to the top set of 1000 that will ever get a LR calculated. On the other hand, if you get a PR and a LR of 1 then you're better off than the guy having a top PR but no LR.
Got it - what's that LR thing then?
It's a devise constructed to yield better results and (repeat quote):
|... prevent any single author of web content from having too much of an impact on the ranking value. |
I have been looking at the patent for a while and this intention could very well be inforced by it. That is, if "authorship" is equal to "domain ownership", or "some unspecified network of affiliated authors".
The LocalScore, or Local Rank, is both a filter and a ranking mechanism. It only considers pages among the 1000 or so selected by the PR.
a) The first step in calculating Local Rank for a page is to locate all pages that have outbound links to this page. All pages among the top 1000 that is.
b) Next, all pages that are from the same host as<tis page or from "Similar or affiliated hosts" gets thrown away. Yes. By comparing any two documents within the set, the one having the smallest PR will always be thrown away, until there is only one document left from the (quote) "same host or similar" as the document that is currently being ranked.
Here, "same host" refers to three octets of the IP. That means the first three quarters of it. In other words, these IPs are the same host:
"Similar or affiliated hosts" refers to mirrors or other kinds of pages that (quote) "contain the same or nearly the same documents". This could be (quote) "determined through a manual search or by an automated web search that compares the contents at different hosts". Here's another patent number for the curious: 5,913,208 (June 15, 1999)
That is: Your on-site link structure means zero to LR. Linking to and from others on the same IP means zero. Near-duplicate pages means zero. Only one page from your "neigborhood", the single most relevant page, will be taken into account.
c) Now, the exact same procedure is repeated for each "host" in the set, until each "host/neigborhood" has only one page left in the set.
d) After this (tough) filtering comes another. Each of the remaining pages have a PR value that they are sorted according to. The top k pages pass, the rest gets thrown away. Here "k" is (quote) "a predetermined number (e.g., 20)."
So, although you positively know that you have 1,234 inbound links, only the top "k" of these, that are not from "your neighborhood" or even "part of the same neigborhood" will count.
e) The remaining pages are called the "BackSet". Only at this stage can the LR be calculated. It's pretty straightforward, but then again, the filtering is tough (quote, although not verbatim, but deviations keep current context):
LocalRank = SUM(i=1-k) PR(BackSet(i))m
Again, the m is one of those annoing unknowns (quote): "the appropriate value at which m should be set varies based on the nature of the OldScore values" (OldScore being PR). It is stated, however, that (quote) "Typical values for m are, for example, one through three".
That's it. Really, it is. There's nothing more to the Local Rank than this.
What about the New Rank then?
This is getting a very long post you know... Well, luckily it's simple, the formula is here, it's as public as the rest, you can't have a patent that's also a secret (quote):
NewScore(x) = (a+LocalScore(x)/MaxLS)(b+OldScore(x)/MaxOS)
x being your page
a being some weight *
b being some weight *
MaxLS being maximum of the LocalScore values, or some treshold value if this is too small
MaxOS being maximum PR for the original set (the PR set)
* Isn't this just beautiful (quote): "The a and b values are constants, and, may be, for example, each equal to one"
Inbound links are still important. Very much so. But not just any inbound links, rather: It is important to have inbound links spread across a variety of unrelated sources.
It could be that blogger sites on blogger.com, tripod sites, web rings, and the like will see less impact in serps from crosslinking. Mirror sites, and some other types of affiliate programs, eg. the SearchKing type, will probably also suffer.
My primary advice from the algebra right now is to seek incoming links from "quality unrelated sites" yet still sites within the same subject area. Unrelated means: Sites that are not sharing the first three quarters of an IP or are in other ways affiliated or from "same neigborhood" (at the very least not affiliated in a structured manner). Quality means what it says.
Links from direct competitors will suddently have great value, as odd as it sounds.
Note: Spam. Cloaking. Shadow domains. Need i say more? I'm sure they will fix that anti spam filter sometime, though.
Candidate for longest post ever... here goes, let's see if the mods like it.
[edited by: Brett_Tabke at 12:01 pm (utc) on July 8, 2003]
| 6:42 pm on Jul 10, 2003 (gmt 0)|
Excellent posts, thanks very much.
Lots of food for thought.
| 6:55 pm on Jul 10, 2003 (gmt 0)|
Hmm here the last 4 days I have not seen any strange things I think the update is over and the index is also not gone, then back again, it just moves around a few places like in a normal everflux.
It realy looks to me that everything is back to normal and we only had a longer update this time because of last update failure.
The only thing that is different then for 2-3 month ago where everyting also was perfect is the linkback count, but as Googleguy said it should not have any influence on your ranking and (visits) that your linkback has decreased ,because it is so for everyone.
So claus I dont know about your theory, it could be in effect or not, but everything looks pretty ok now.
A happy webmaster
| 8:01 pm on Jul 10, 2003 (gmt 0)|
great news zeus! I've been so busy with this very interesting thing that i haven't really noticed much else. Now it's time for work again, gotta get a few sites optimized. Make them better, that is ;)
| 8:10 pm on Jul 10, 2003 (gmt 0)|
Claus, now dont let your theory fall away it sounds interesting and I hope you see the same on Google now.
| 1:46 am on Jul 12, 2003 (gmt 0)|
I realise the discussion is now over but I've just stumbled on the thread and spent an hour or so reading and digesting, and just wanted to add my thanks to claus for an awesome original post and follow ups - top job, muchos respectos :)
| 2:14 am on Jul 12, 2003 (gmt 0)|
Hey, thanks. I'd interpret this algorithm as (max(creme) + max(de la creme))= rank order, or results based on [creme de la creme] link structure. This would seem to knock down the large sites a peg and enable very good small sites to rise to the top.
| 4:09 am on Jul 12, 2003 (gmt 0)|
Great post, Claus - plenty of food for thought (my head hurts)
| 9:54 am on Jul 12, 2003 (gmt 0)|
a quick note to focus you on the update cycle.
-since the update in midde June we saw 2 sets of Google results every 3 days.
- but for the last 10 days, we have been seeing them make just one update every 5 days. The current version of Google results is on it's 5th Day.
| 10:58 am on Jul 12, 2003 (gmt 0)|
Bentler, the "creme de la creme" concept is great, didn't think of that ;)
>>it was more or less acknowledged as being a variation on Kleinberg.
- must have overlooked that somehow. Personally, i tend to think of the Fishbein method widely used in surveys. You take some "ranking" element for each item, and then you add an "importance" element. Then, sort the whole thing so that those that do best are those that score high on both "importance" and "ranking". After all, what is best; "high ranking and low importance" or "low ranking and high importance".. don't bother to answer ;)
Difference being: Here, it's not your individual opinion as a user that counts, it's the "opinion of the web" - which is nice, as 1) it adds the "peer review" dimension, and 2) you're not supposed to know about every single page out there anyway :)
- five days for a complete update, that's pretty fast. Wow. Could that be because of distributing the spiders? And now they're able to aggregate data properly again? Some p2p-se that is. Sure makes one think.
<added>from 3 days buggy to 5 days okay... i'm not sure it's significant, there's a lot of datawash in any case - they might just have added a laundromat, if that was the point that is. Then again, if the washing machines sort by color and the dryers by fabric... you get the point. Either way five days are very fast.</added>
[edited by: claus at 12:07 pm (utc) on July 12, 2003]
| 11:56 am on Jul 12, 2003 (gmt 0)|
Very interesting claus, excellent work.
Sounds like content will still be king....
But, I guess it also wouldn't hurt to call my dellhost rep and see if they are offering any 'baker's dozen' specials on dedicated servers with disassociated IPs. ;)
| 3:32 pm on Jul 12, 2003 (gmt 0)|
Excellent analysis. However...
|b) Next, all pages that are from the same host as<tis page or from "Similar or affiliated hosts" gets thrown away. Yes. By comparing any two documents within the set, the one having the smallest PR will always be thrown away, until there is only one document left from the (quote) "same host or similar" as the document that is currently being ranked. |
Here, "same host" refers to three octets of the IP. That means the first three quarters of it. In other words, these IPs are the same host:
either Google is not enforcing this or they miss some clusters since as of today I'm still seeing serps that's dominated by domains belonging to the same subnets(not shared IPs) as in...
The only thing I noticed about these domains is that on a whois lookup they appeared to be owned by different entity. However, closer inspection shows that those domains are own by the same entity.
So, if your analysis about 'same host' is true, could it be Google is using a whois database in identifying these hosts? If so, then all it takes to che.. err... optimize cross linking is to vary the registrar information.
Just my 2 cents ;)
| 3:46 pm on Jul 12, 2003 (gmt 0)|
-as i read it, they look at two things:
a) the IP
b) some content-similarity (however defined)
Pages from same IP could in fact still show up in the serps (reason for "don't worry" to bloggers etc.), it would just affect the relative order of these pages within the set of serps (eg lower PR climbing over higher PR due to "importance").
Guess it's still a "definite maybe".
| 4:02 pm on Jul 12, 2003 (gmt 0)|
Do you really think they use whois data? I think this would violate the TOS of the NICs .
% a WHOIS query, you agree that you will use this data only for lawful
| 10:53 pm on Jul 12, 2003 (gmt 0)|
Using Whois data:
Using Whois would in some ways be more accurate and incriminating to owners in many cases. How many of you use different admins, but the same tech contact within your company for instance? You're just plain unlucky/stupid if you used the same admin contact for all your domains all this time. We switched in 1999 for even our few (<10).
I believe IPs would be far quicker and easier to obtain and analyze however (a mere lookup in the local DNS cache or tracert function).
I don't think Google is using either method yet. A local newspaper who also runs a top ranked local travel site, benefits greatly from having tens of thousands of tiny interlinks between every news story on their newspaper side to the home page of the travel site. Both their sites are whois'd to the same exact contact/address info AND only a few IP addresses apart and they have not budged in a year and in fact have moved up slightly in the last month due to some other optimizations they're made recently.
A second competitor of theirs with almost the exact same story, however, has totally different contact info AND servers for their two sites and also has not moved in the rankings in the last year either. The backlinks, which we have closely traced, have remained virtually unchanged during the period as well, while others have dropped precipitously.
Contrary to the feelings of some of you che.optimizers I hope they DO implement some sort of close internetworking degradation like this. This might have the life-changing effect of reducing the "content is king" (where content here means simply "sheer mass" and not "quality" as judged by others) aspect from the "big guys" (the $mega-million media publishers and News service RE-publishers) as well as, as mentioned before, those who WASTE hundreds of domain names and IP address in order to self-promote.
... I swear If I get one more automated cross-linking e-mail with 50 domain names listed offering to exchange links I'll...
Maybe it's time to start buying stock in the under $9.95 month hosting companies since that's where all those types will be fleeing.
| This 74 message thread spans 3 pages: < < 74 ( 1 2  ) |