I looked through the documents on the Almaden website regarding the CLEVER engine. I skipped over the .ps ones as I don't have a postscript viewer installed anywhere and didn't feel like looking for one. I also had a professor here at VT explain some of the algorithm concepts and the problems in Google which CLEVER was fixing. The gentleman in question works heavily in data mining and the information is what got me to look into CLEVER a little more in depth. For those who are really interested, the easiest article to read is the Scientific American article:
It gives a reasonably good idea of the technology as well.
I don't think anyone here has posted any summary of this yet, so here's my shot at it:
Basically, one of the largest problems with traditional(first generation) search engines is that the on page criteria generally don't give a good estimate for what the page is about. IBM's home page for example doesn't have the word 'computer' on it anywhere, but thatís definitely one of the main single word keywords that is related to it.
Google fixes this problem with link analysis, but it ends up generating some other problems. To use the IBM example again, both IBM and Macintosh for example could be classified as authorities when searching for 'computer manufacturer', but due to the fact that they are competitors, you'd be hard pressed to find a link on either of their web sites to each other. Thus they are failing to reinforce that authority interlinking. They also are unlikely to link to hubs/vortals because that would also, albeit more indirectly, promote their competitors. So in some competitive themes, it was found that the real authoritative sites generally don't link out much. They cover most of the information needed on their own sites. Educational or research topics are less prone to this, but they as far as CLEVER's engineers are concerned, creating search algorithms around just these topics is only solving the problem for special cases.
The other problem with Google, that has been explored by other search engines (teoma for example) recently is that Google calculates its linking score (page rank) by summing the links from all other sites on the web ignoring the themes of those particular sites. A site like yahoo would thus have a very high ranking score, irregardless of what the search is on. Even if only one page on their entire site mentioned 'webmasterworld' in small text, if positioned correctly it would likely outrank the real webmasterworld in the simplified version of Google's algorithm. Of course Google has also thrown in link text and probably some tweaks on major authorities like yahoo to improve that, but the basic algorithm is still fundamentally flawed.
CLEVER tries to solve both problems simultaneously. The CLEVER Engineers (CE's) first begin by recognizing that if they were to calculate a themed page rank for every page with every theme or search, it would be rather slow. This would be especially true considering that every possible search could not be pre-stored. So the CE's decided to take a set of pages using a traditional(on page criteria) search engine strategy. The publications indicate that the experiment used the first 200 results from Altavista on a particular search, although in a real search situation, I would expect them to use their own database/algorithm for this as well as possibly a larger size.
After taking this initial root set, CLEVER would then spider all these pages for links and index the linked pages as well. Then it would spider all the pages reported to be linking to the root set. All these new pages would be added to the root set. It would repeat this one more time to get 2 levels of links in both directions from the initial root set. They found that this new root set would generally be between 1000 and 5000 pages. A number quite bearable for a computer to process.
The really important step: After collecting this themed set of pages, CLEVER engineers decided to calculate 2 different types of Page Rank(Google's words) or simply link scores. They hypothesized that the important pages on any subject could be classified into 2 types of sites. The first type is the IBM/Macintosh type - authorities on a subject. The second type is the hub that links to both IBM and Macintosh, Slashdot or something maybe. They hypothesized that the good authorities would have a high number of incoming links from high quality hubs and that high quality hubs would have a high number of outgoing links to high quality authorities. They did allow for a page to be both a high quality authority and a high quality hub. Each page in the root set would have both a hub rank value and an authority rank value. In a method similar to Google, they would set all the hub ranks and authority ranks to 1 initially and then iterate. The hub scores would be calculated based on the authority scores of the sites that they linked to and the authority scores would be calculated based on the hub scores of the sites the page received links from. A very small number of iterations would be needed to get this to converge (something like 5 iterations), so it was reasonably fast - still not as good as Google unfortunately, but hardware power is making that less of an issue now.
After all the hub and authority scores were calculated in the experiments, the CE's decided to simply show the top 15 hubs and top 15 authority sites as the first 30 results. The way these were ordered was not discussed (perhaps alternating hub->authority->hub->authority, or perhaps just based on the score values). Of course in a real search engine, more results would be used and I would suspect that authorities would get a better ranking than hubs(more likely to have unique content within 1 click), but thatís just a guess.
A few other topics were discussed in less detail: There was a discussion of how the link structure revealed subsets of websites, such as pro and anti abortion sites within a set of simply 'abortion' sites. There was also mention of criteria such as link text and text surrounding the link within a n-byte(n being a parameter determined by experimentation) window in either direction.
Of course, all of this probably impacts you very little at the moment. CLEVER is not being used as a consumer search engine. I am uncertain whether IBM even has any serious interest in making it one. However, the algorithm does make some good sense and at least some of the ideas may show up in similar forms at other search engines (Google has definitely read over some of this I would guess).
Some very interesting SEO ideas that came to me should any of this begin to become important to consider:
1)On page criteria is again a factor to consider. You need to be in the root set chosen before the iterations. Whether you do that by getting good linking or by on page criteria is unimportant thought. You can also get into the final root set by simply linking to someone in the initial root set.
2)In theory it may be able to roughly determine the initial root set by using the final root set and analyzing the links. If you could determine the criteria to get into the initial root set, you might be able to get more of your linking structure within it and thus let your own pages effect it more.
3)Hubs can be created with far less work/luck than authorities. All you have to do is run a rough CLEVER calculation yourself and link to the authorities that show up(or analyze the SERPs and decide by hand which are authorities). I suspect this could become a major source of spam, so there will probably be some level of incoming linking(authority score) required as well for hubs to get a decent ranking, but definitely less so than for authorities.
4) The best strategy to create a good authority besides a link request system would be to create a number of seemingly unrelated hubs(different ip classes, different domains, etc) that link to only the biggest authorities as well as your own.
Has anyone else read over the papers and come up with some of their own conclusions that they would like to add?