Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: phranque
I read some people who think sites ranking highly which do not contain the search term in question must be spam. This often is not at all the case. It has a lot to do with the latest algorithms Altavista and others have been using.
A few months ago I read a research paper written by groups of researches including Altavista's research team about how they used link popularity in assessing rankings. Basically this is what I could make of how they rank. I don't guarantee this is precisely what they use but I have a strong inkling based on observation that they use some similar variation.
1. Given a particular search term, search the entire database of crawled sites for a subset of sites with high textual relevance to the term. Lets call this set, H.
2. Collect all of the sites that are linked to (above a linking code relevance threshold) from any site in H. Lets call this set, A.
3. "Hubs" are sites in H which have lots of relevant links to "authorities" in A. "Authorities" are sites in A which are relevantly linked to by the best "hubs". Sound circular? It is - but some neat linear algebra called eigen-analysis makes sense of it and assigns a quality measure to each hub and each authority.
4. The rankings for the chosen search term are the highest ranking authorities selected as per above. (Also hubs seem to be rewarded highly as well - thats why linking to relevant sites can often be beneficial)
So, as you can see, a high ranking authority for a particular search term need not have that particular search term anywhere on the page PROVIDED it is linked to by a lot of hubs that do.
Without an appreciation of the above, I really dont see how anybody could possibly still have their sanity intact ;)
Most of us are well aware of the link popularity weighting into AV's algorithm, and their has been some discussion over the past year, about Clever(IBM)'s new link popularity, authority hub model...
While AV has been the most visible, as is Google, there are many other's (iWon's Inktomi tuned Algo), that are employing this methodology...
Again, welcome to WMW...
Thanks for the welcome.
I know that many understand the importance of link popularity. I have read much of this forum and was a participant in Brett's Buddylink program. However without an understanding of the specifics of the actual algorithm, link popularity alone is not enough to explain this lack of need for textual relevance.
The impression I get of many people's understanding of the role of link popularity is that a site that isn't textually relevant to the term cannot be ranked highly even with great link popularity. This would seem to make sense because naturally a lot of sites would link to places like search engines, and sponsors which are not specifically relevant to the term. So, the thinking goes, to overcome the problem that such sites will rank highly for any term, the page should also be optimized textually for the term.
However, the way I believe Altavista gets around this is by ensuring that the linking text is relevant. I don't think Altavista does a great job of this some times because you will occasionally find a site on a more general topic or even unrelated topic ranked highly for a term.
That reasearch paper you mentioned, would that be the one presented at the Amsterdam conference in May, where Altavista. Google and Compaq researchers describe the concept of a "Term vector database".
If so, do you redzone, or anyone else know if that is into production yet or still in the labs.
The paper was called "Authoritative Sources in a Hyperlinked Environment" by Jon M Kleinberg. I dont recall where I got this - but probably through an academic paper search but I will email it to anyone who wants it. However there was an extension to this paper by Altavista (or DEC or Compaq) researchers discussing how to improve this using link relevance. I am trying to locate this paper - it may have been at this conference.
As to whether it has been implemented, search result evidence says yes, but apart from this I vaguely recall them talking about these techniques in a "we-just-implemented-them" way but I will try to confirm this.
Looks like the Altavista research paper describing the extension to Kleinberg's paper wasnt presented at the conference discussed on the URL Brett posted either. Still looking for it....
My apologies for the sometimes vague recollection - I read a ton of stuff and file away the useful stuff somewhere in my brain and bookmarks (in between disk crashes). I'm going to do a braindump despite possible inaccuracies because I believe there may be some good stuff filed away there somewhere. :)
One thing I recall from either the paper I was referring to or another one along the same lines was another method to determine if two sites that are linked are "relevant".
There is where the term vector database paper posted in the conference Brett mentioned is relevant to this algorithm as I recall -
Under this method, two sites are considered "relatively-relevant" if the dot products of their normalized term vectors is above a certain threshold.
In plain language, what this means is, an inward link is considered relevant if there is ANYTHING (of threshold substance) textually in common with two pages. So if your site talks about famous historians, Southpark jokes and train-spotter quotes and your site links to another site which is entirely about Southpark jokes, then the two sites are considered relevant (under this scheme) in link popularity calculations even for the term "famous historians".
Not entirely sure which method is implemented in Altavista at this time - couldnt be too hard to determine this empirically though.
ArmchairExec, this is extremely interesting. But do you mean that there would have to be several of these 'south park, famous historian' sites pointing to the 'south park' site, in order that the 'south park only' site also be relevant for 'famous historians'? (Yeucch! Subjunctive!)Because we are talking about link pop. above a certain threshhold. Or is it enough that the 'south park, famous historians' site is an authority hub, and that's all you need?
> But do you mean that there would have to
> be several of these 'south park, famous
> historian' sites pointing to the 'south
> park' site, in order that the 'south park
> only' site also be relevant for 'famous
> historians'? .....
> Or is it enough that the 'south park,
> famous historians' site is an authority
> hub, and that's all you need?
According to the the algorithm I described, it depends on the "quality" of the hub in your second scenario. The eigen-analysis will give you a spectrum of values for the quality of each hub and authority. Generally, the higher the quality of the hub, the more beneficial the links from them are. However to be a high ranked authority would require a "quality package" (which may just have one or few elements) of hubs pointing to it.
(Remember I am not sure which method of linking relevance they use. I am just telling you what I read in various papers - some written by Altavista researchers)
If this algorithm is indeed what is being used, its pretty clear what the way to get good placement would be. Contact the highest ranking sites for your terms and ask them for a link AND/OR link to all those sites (using relevant text)!
The closet paper that I can think of would be the Computing Web Page Reputations [www9.org] Which talks about improving link popularity through hubs and authorities (this is the paper I was reffering to in renke's quote) It's from the department of computer science at the university of toronto.
"Who is Michael Campbell?"
He wrote "Nothing But Net", has a pay for "SEO secrets" site, and a auto submission software that's promoted on Planet Ocean. (although I've never purchased any of his stuff, so I can't give a critique)
Note that this paper does NOT confirm what I said earlier about non-query-specific relevance. This paper takes a normalized dot product between the query term vector and the page term vector. Relevance defined here is very query-specific. (So yes, textual optimization is still very important if this is what is being used at AV)
I'm pretty sure this isn't the paper I was thinking of but its more good evidence that a version of the Kleinberg algorithm is used in Altavista.
Thanks for the reference doc. Improved Alg for topic dist etc...
While reading the paperI was continually reminded of the old Dykstra Algo. (Taught in DB Engineering 1st or 2nd year) and was initially designed as a search for the shortest distance between 2 points and is used today in the airline reservation industry.
My impression of the paper was that it focused mainly on the retrieval of info based on a criteon. But reading data from the database can be time consuming and therefore there appears to be several index layers between the search interface and the result set.
The information generated by spiders is 1 classified and indexed on a main index, which intern 2 references a hub index which eventually leads to 3 a target data set.
I wish I had a document that identified the classification algo by which the data is indexed. With that knowlege, I believe consistent top rankings could be achieved.
Automatic resource compilation by analyzing hyperlink structure and associated text [decweb.ethz.ch]. Same group of names that worked on IBM's clever project. That is the research I've heard Inktomi based their context directory engine on.