homepage Welcome to WebmasterWorld Guest from 54.166.110.222
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
Forum Library, Charter, Moderators: phranque

SEM Research Topics Forum

    
Almaden / CLEVER
Some analysis of the algorithm
ggrot




msg:816476
 2:25 am on Mar 19, 2002 (gmt 0)

I looked through the documents on the Almaden website regarding the CLEVER engine. I skipped over the .ps ones as I don't have a postscript viewer installed anywhere and didn't feel like looking for one. I also had a professor here at VT explain some of the algorithm concepts and the problems in Google which CLEVER was fixing. The gentleman in question works heavily in data mining and the information is what got me to look into CLEVER a little more in depth. For those who are really interested, the easiest article to read is the Scientific American article:
[sciam.com...]

It gives a reasonably good idea of the technology as well.

I don't think anyone here has posted any summary of this yet, so here's my shot at it:

Basically, one of the largest problems with traditional(first generation) search engines is that the on page criteria generally don't give a good estimate for what the page is about. IBM's home page for example doesn't have the word 'computer' on it anywhere, but thatís definitely one of the main single word keywords that is related to it.

Google fixes this problem with link analysis, but it ends up generating some other problems. To use the IBM example again, both IBM and Macintosh for example could be classified as authorities when searching for 'computer manufacturer', but due to the fact that they are competitors, you'd be hard pressed to find a link on either of their web sites to each other. Thus they are failing to reinforce that authority interlinking. They also are unlikely to link to hubs/vortals because that would also, albeit more indirectly, promote their competitors. So in some competitive themes, it was found that the real authoritative sites generally don't link out much. They cover most of the information needed on their own sites. Educational or research topics are less prone to this, but they as far as CLEVER's engineers are concerned, creating search algorithms around just these topics is only solving the problem for special cases.

The other problem with Google, that has been explored by other search engines (teoma for example) recently is that Google calculates its linking score (page rank) by summing the links from all other sites on the web ignoring the themes of those particular sites. A site like yahoo would thus have a very high ranking score, irregardless of what the search is on. Even if only one page on their entire site mentioned 'webmasterworld' in small text, if positioned correctly it would likely outrank the real webmasterworld in the simplified version of Google's algorithm. Of course Google has also thrown in link text and probably some tweaks on major authorities like yahoo to improve that, but the basic algorithm is still fundamentally flawed.

CLEVER tries to solve both problems simultaneously. The CLEVER Engineers (CE's) first begin by recognizing that if they were to calculate a themed page rank for every page with every theme or search, it would be rather slow. This would be especially true considering that every possible search could not be pre-stored. So the CE's decided to take a set of pages using a traditional(on page criteria) search engine strategy. The publications indicate that the experiment used the first 200 results from Altavista on a particular search, although in a real search situation, I would expect them to use their own database/algorithm for this as well as possibly a larger size.

After taking this initial root set, CLEVER would then spider all these pages for links and index the linked pages as well. Then it would spider all the pages reported to be linking to the root set. All these new pages would be added to the root set. It would repeat this one more time to get 2 levels of links in both directions from the initial root set. They found that this new root set would generally be between 1000 and 5000 pages. A number quite bearable for a computer to process.

The really important step: After collecting this themed set of pages, CLEVER engineers decided to calculate 2 different types of Page Rank(Google's words) or simply link scores. They hypothesized that the important pages on any subject could be classified into 2 types of sites. The first type is the IBM/Macintosh type - authorities on a subject. The second type is the hub that links to both IBM and Macintosh, Slashdot or something maybe. They hypothesized that the good authorities would have a high number of incoming links from high quality hubs and that high quality hubs would have a high number of outgoing links to high quality authorities. They did allow for a page to be both a high quality authority and a high quality hub. Each page in the root set would have both a hub rank value and an authority rank value. In a method similar to Google, they would set all the hub ranks and authority ranks to 1 initially and then iterate. The hub scores would be calculated based on the authority scores of the sites that they linked to and the authority scores would be calculated based on the hub scores of the sites the page received links from. A very small number of iterations would be needed to get this to converge (something like 5 iterations), so it was reasonably fast - still not as good as Google unfortunately, but hardware power is making that less of an issue now.

After all the hub and authority scores were calculated in the experiments, the CE's decided to simply show the top 15 hubs and top 15 authority sites as the first 30 results. The way these were ordered was not discussed (perhaps alternating hub->authority->hub->authority, or perhaps just based on the score values). Of course in a real search engine, more results would be used and I would suspect that authorities would get a better ranking than hubs(more likely to have unique content within 1 click), but thatís just a guess.

A few other topics were discussed in less detail: There was a discussion of how the link structure revealed subsets of websites, such as pro and anti abortion sites within a set of simply 'abortion' sites. There was also mention of criteria such as link text and text surrounding the link within a n-byte(n being a parameter determined by experimentation) window in either direction.

Of course, all of this probably impacts you very little at the moment. CLEVER is not being used as a consumer search engine. I am uncertain whether IBM even has any serious interest in making it one. However, the algorithm does make some good sense and at least some of the ideas may show up in similar forms at other search engines (Google has definitely read over some of this I would guess).

Some very interesting SEO ideas that came to me should any of this begin to become important to consider:

1)On page criteria is again a factor to consider. You need to be in the root set chosen before the iterations. Whether you do that by getting good linking or by on page criteria is unimportant thought. You can also get into the final root set by simply linking to someone in the initial root set.

2)In theory it may be able to roughly determine the initial root set by using the final root set and analyzing the links. If you could determine the criteria to get into the initial root set, you might be able to get more of your linking structure within it and thus let your own pages effect it more.

3)Hubs can be created with far less work/luck than authorities. All you have to do is run a rough CLEVER calculation yourself and link to the authorities that show up(or analyze the SERPs and decide by hand which are authorities). I suspect this could become a major source of spam, so there will probably be some level of incoming linking(authority score) required as well for hubs to get a decent ranking, but definitely less so than for authorities.

4) The best strategy to create a good authority besides a link request system would be to create a number of seemingly unrelated hubs(different ip classes, different domains, etc) that link to only the biggest authorities as well as your own.

Has anyone else read over the papers and come up with some of their own conclusions that they would like to add?

 

feeder




msg:816477
 2:28 am on Mar 19, 2002 (gmt 0)

Excellent post ggrot :) Will take some time to digest...

msgraph




msg:816478
 3:18 am on Mar 19, 2002 (gmt 0)

Great post ggrot. There have been some threads on Almaden/Clever but not much about taking it apart.

Like feeder said I have to set some time to digest it too.

The boys at Almaden have been pretty silent lately but I'm hoping that they will release it as a public search service here soon. I think one of their main problems was that it took longer to process than Google's own methods. If they tweak that out then.... ;)

When I have the time I'm going to read over the following patent that was issued last week. I think you will find it interesting since if falls along the lines of this post :)

Method for interactively creating an information database including preferred information elements, such as preferred-authority, world wide web pages [164.195.100.11]

If you search through the Patent DB on a few of the Clever people, you will find a wealth of information associated with Clever and HITS.

Will post some more later.

ggrot




msg:816479
 4:43 am on Mar 19, 2002 (gmt 0)

Its interesting to see how the original technology we all know as Page Rank is beginning to be expanded on by various companies (teoma, IBM, others too probably). After reading the CLEVER stuff, it is rather apparent that link analysis is still in infancy and yet it is already extremely powerful. Google was definitely ahead of it's time, but only time will show if they stay ahead of the pack.

TallTroll




msg:816480
 4:29 pm on Mar 19, 2002 (gmt 0)

I see a spider tagged as "http://www.almaden.ibm.com/cs/crawler" pass through occasionally, out of 66.147.154.3. Its well behaved, but I'm curious as to where it gets its links from. It has requested URLs that don't exist, but that curiously, I get the odd Google referral to

Is this a CLEVER bot, or related, or what? anyone know? the IBM site isn't hugely helpful

msgraph




msg:816481
 4:37 pm on Mar 19, 2002 (gmt 0)

Most likely they received a large part of their lists from the Internet Archive.

[archive.org...]


IBM Research Projects

...to find out how Web sites point to one another and form communities of common interest. Baumgart says that "unleashing" the IDM on the Archive's data reveals "clusters of activity.... You see what was hot, what the breaking story was, say, on a given date two years ago."


TallTroll




msg:816482
 4:44 pm on Mar 19, 2002 (gmt 0)

LOL, the site isn't that old! Although it is getting a rejig soon (loads more lovely content), and hopefully will get a facelift soon as well. I notice the IBM crawler never asks for robots.txt (that I remember)

Maybe it is getting that from elsewhere also?

Rumbas




msg:816483
 6:10 pm on Mar 19, 2002 (gmt 0)

I just noticed that Almaden/IBM crawler today in some logs. Wondered where it came from - thanks for clearing that one.

Great post btw ggrot.

volatilegx




msg:816484
 7:32 pm on Mar 19, 2002 (gmt 0)

Beautiful paper, ggrot! I'd love to see a new search engine built using this technology.

rubble88




msg:816485
 11:16 pm on Mar 19, 2002 (gmt 0)

The SciAm is truly a great article. I was just discussing it the other day at seminar I was giving.

In many ways, Teoma, soon to release a second beta, is being developed using many of the Clever concepts.

This extended abstract, "DiscoWeb: Applying Link Analysis to Web Search" coauthored by Apostolos Gerasoulis, might also be of interest.
The abstract is at:
[cs.rutgers.edu...]

Gerasoulis, is currently the Vice President of Research and Development at Ask Jeeves, Teoma's parent.

Finally, another well written article on Clever and link analysis in general appeared in the December, 2001 issue of Science. It was co-authored by Jon Kleinberg a member of the Clever team and Steve Lawrence, one of the developers of ResearchIndex.Com.
"The Structure of the Web"
[cs.cornell.edu...]

conor




msg:816486
 1:25 pm on Mar 20, 2002 (gmt 0)

Fantastic post Ggrot. Again I am going to have to print this out and digest it. Looks like I know whats for lunch and dinner for today ;)

Winooski




msg:816487
 7:20 pm on Mar 20, 2002 (gmt 0)

"(1)On page criteria is again a factor to consider. You need to be in the root set chosen before the iterations. Whether you do that by getting good linking or by on page criteria is unimportant thought. You can also get into the final root set by simply linking to someone in the initial root set.

2)In theory it may be able to roughly determine the initial root set by using the final root set and analyzing the links. If you could determine the criteria to get into the initial root set, you might be able to get more of your linking structure within it and thus let your own pages effect it more. "

I'm not sure I understand. (Hah! That's putting it mildly!) When you write about "getting into the root set", do you mean somehow getting one of the pages/sites that's in the root set to link to your page/site?

Otherwise, it would be like one of those game theory problems where the "winner" of the game (i.e., the top-ranked page) is really the party that makes the first move (i.e. is able to be included in the root set).

ggrot




msg:816488
 10:53 pm on Mar 20, 2002 (gmt 0)

According to the articles on CLEVER, only pages in the final root set are even considered as candidates for high ranking thus the need to be in the final root set. There are 3 ways of doing this:
1) Being selected as one of the first 200 pages in the initial root set. These are apparently selected by some type of on-page algorithm that used to be the norm in the past for major search engines(inktomi still uses on-page severely).
2) Being linked to (directly or indirectly through a middle level) by the initial root set. This is the most difficult option generally.
3) Linking to the initial root set (again, directly or indirectly through a middle level)

Having all 3 would be best, but if that isn't possible, the first option is preferable for a few reasons:

If you end up getting multiple of your pages/sites into the initial root set of 200 pages, you could in theory control the outcome of the link score algorithms to a higher degree with internal(pages within sites) and cross-linking(multiple sites) strategies.

For example, if your pages comprised 100 of the pages in the initial root set, then probabalistically half the pages in the final root set are either linked from or link to your pages and thus your pages will definitely get an unfair advantage. Its like ballot stuffing if you use the 'voting' anaology. Or better yet, its like assasinating your competition from the beginning so you are the only candidate.

There is another way of looking at this for optimization purposes as well. If you simplify the model, you will realize that getting in the initial root set aloneeimproves your odds of having a better ranking. The reason for this is that if you are in the initial root set, you are guaranteed to have ALL of the links that point to you included in the final root set as well as ALL of the links that point to these pages included in the final root set. If you only get included at the first iteration, then the final root set is only guaranteed to include all the links that are pointing to your pages directly. Finally if you are included in the final iteration, then you aren't even guaranteed to have all the links pointing to you to be included (your votes might not even be tallied).

I hope that makes at least some sense. I have a strong math/graph theory background and thus I might not be explaining this logic in the best way. Feel free to ask questions.

caine




msg:816489
 11:58 pm on Mar 20, 2002 (gmt 0)

ggrot,

i had to read all that, yes it does make sense, very impressive, i got into studying algorythms and heuristics at uni, though more concentrated on open source bots, such as Eliza to Alice, which though i was looking from a natural linguistic approach impressed me greatly, i never knew to what extent some of these bots could be taken too in relation to all of the information on the net.

jamsy




msg:816490
 11:29 pm on Mar 21, 2002 (gmt 0)

Great Post - is your alias - Brett

:)

Lurk




msg:816491
 8:49 am on Mar 23, 2002 (gmt 0)

I'm getting hits from [almaden.ibm.com...] and they appear to be using ODP or at least part of it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved