Let's look at the nonsense posted here:
"Calculating Web page reputations".
These guys obviously haven't looked at any Web surfing behavioral studies. Their base assumptions are all incorrect. There is no such thing as a "random surfer", except for spiders. Surfers looking for information are directed in their link choices by the Web designer's presentation of the links.
Furthermore, no link to any Web page is an indication of the accuracy of the information of that page. All it indicates is that one human being was willing to link to another human being's Web site. The fact of the linkage doesn't imply that the linking Web page (P) is knowledgeable about the topic, let alone authoritative, nor that the linked Web page (Q) is knowledgeable about the topic, let alone authoritative.
1000 links will not make a badly written paper posted on the Web any more knowledgeable or authoritative on a given topic than 1 link.
Finally, and this is where I stop analyzing this nonsense (because, as anyone who has submitted a proof for examination knows, all arguments fail at the first flaw), they write:
"In the setting of the Web, our assumption that a page has at least either one incoming link or one outgoing link may not hold. However, since we are dealing with collections of pages collected by crawling, we feel justified in assuming they all have at least one incoming link."
WRONG. There is absolutely no justification for this secondary assumption. New Web pages are regularly submitted to the search engines by Webmasters to be crawled. The engines do not get to those pages through other pages. The entire concept breaks down because of this one stupid assumption that anyone who has created a Web page and submitted to the search engines knows is not true.
The database is being flooded with pages which break the model. Since the model is broken, the algorithm based on the model won't work as expected. It will, in fact, deliberately and consistently filter out pages that are both relevant and authoritative in favor of POPULAR Web pages which may or may not have merit.
The bottom line is that the best marketer will win in the long run because this model favors a well-marketed Web site over a poorly marketed Web site.
So much for "reputations". Web page reputations mean nothing to the individual surfer, who is more likely to be influenced by wording and tagging, as well as placement of links (and THAT has been established in Web surfing behavioral studies, which I have read, unlike these guys).
Now, on to "identifying topics".
"In this section, we show that it is still possible to approximately find the topics a page has a high reputation on, although the ranks will not reflect the real probability distributions."
Well, considering that their probability functions don't work the way they should, that shouldn't be a real problem for them.
"Given a page p and a parameter d > 0, suppose we want to find the reputations of the page within the one-level influence propagation model. If the page acquires a high rank on an arbitrarily chosen term t within the full computation of Algorithm 1, then at least one of the following must hold: (1) term t appears in page p, (2) many pages on topic t point to p, or (3) there are pages with high reputations on t that point to p...."
First flaw in the argument: they are assuming that they have correctly established the reputations of Web pages. They have not. Hence, everything fails from this point on (and this is the beginning of the algorithm).
But I'll be gentle. I'll continue.
"...This observation provides us with a practical way of identifying the candidate terms. We simply start from page p and collect all terms that appear in it. We then look at the incoming links of the page and collect all possible terms from those pages. We continue this process until we get to a point where either there is no incoming link or the incoming links have very small effects on the reputations of page p. Let us denote the maximum number of iterations by k. The algorithm can be expressed as follows:"
The above included just to document their methodology. Since the methodology relies upon an obviously flawed model, we know it won't produce correct results.
"Duality of Terms and Pages"
Granting that their functions DO return result sets, what follows is essentially correct. But the old Garbage In/Garbage Out principle still applies:
"Indeed, if we fix p in equations 1, 3, 4 to a specific page, we will find the reputation ranks of the page for every possible topic t. We may then report the topics with the highest reputation ranks. If we fix instead t in the same equations to be a specific topic, we will find the reputation ranks of every page on topic t. Again, we may report those pages with high reputation ranks first in the answer to a query."
Hence, they have diluted their database with improperly ranked pages, so all they can do at this point is optimize the search over the bad rankings.
"In this section, we describe a preliminary evaluation of our approach. Since we did not have access to a large crawl of the Web, it was not feasible to do the full rank computations of Section 3.1. We also did not fully implement the approximate algorithms suggested in Section 3.2 due to the limitations imposed by the search engines we used, either on the maximum number of entries returned for a query or on the response time."
Of course, even if they did have access to a "large crawl of the Web", they still would be arbitrarily excluding NEW Web pages, which make up a significant percentage of content in any search engine which accepts submissions (either directly, as Alta Vista and Google do, or indirectly, as Inktomi does).
So, once again, they proceed from a highly flawed model.
Then they don't even bother to implement the full methodology they have just described in the previous sections. May the Food and Drug Administration never approve medications in this fashion. The sloppy approach to validating their incorrect assumptions is simply mind-staggering. That anyone would give credence to this nonsense is even more mind-blowing.
"Known Authoritative Pages"
5.1 Known Authoritative Pages
"In our first experiment, we picked a set of known authoritative pages on queries (java) and (+censorship +net), as reported by Kleinberg's HITS algorithm , and computed the topics that each page was an authority on. As shown in Figure 1, the term ``java'' is the most frequent term among pages that point to an authority on Java."
Okay, this test need not be diced in detail. Basically, they picked the test most likely to validate their theory to see if the theory is easily validated. If it had failed this test, they would have stopped and revised their functions (and, who knows, maybe they had to do a fair amount of revision just to pass this test).
"In another experiment, we used Inquirus , the NECI meta-search engine, which computes authorities using an unspecified algorithm. We provided Inquirus with the query (``data warehousing'') and set the number of hits to its maximum, which was 1,000, to get the best authorities, as suggested by the system. We picked the top four authorities returned by Inquirus and used our system to compute the topics those pages have high reputations on. The result, as shown in Figure 3, is again consistent with the judgments of Inquirus."
This is a bolder test. They were applying their functions to a randomly selected data set (they didn't know how it would be selected). But notice that they used a term, "data warehousing", that is frequently brought up within the fields of computer science, data processing, and related subjects. These are not fields which are normally subjected to intensive marketing techniques. So this random test set is actually a more predictable test set. The meta search engine was most likely designed by people who make the same kinds of assumptions as these guys.
Since I cannot prove that, all I can say is that this test has questionable value. They have not validated the meta search engine they selected (let alone their selection methodology). The standard of completeness applied to the experimentation is thus very minimal.
"Personal Home Pages"
"In another experiment, we selected a set of personal home pages and used our system to find the high reputation topics for each page. We expected this to describe in some way the reputation of the owner of the page. The results, as shown in Figure 4, can be revealing, but need to be interpreted with some care. Tim Berners-Lee's reputation on the ``History of the Internet,'' Don Knuth's fame on ``TeX'' and ``Latex'' and Jeff Ullman's reputation on ``database systems'' and ``programming languages'' are to be expected. The humour site Dilbert Zone  seems to be frequently cited by Don Knuth's fans. Alberto Mendelzon's high reputation on ``data warehousing,'' on the other hand, is mainly due to an online research bibliography he maintains on data warehousing and OLAP in his home page, and not to any merits of his own."
I've looked at the table. Basically, the data set is flawed. It's too small a sample to be of any value. The largest number of links available was 1733 and all four pages evaluated were taken from academic servers (if I may be allowed to include the www.w3.org server in the academic world). Most personal pages are not devoted to computer science related themes, and are not found on academic servers.
Their last test went after Computer Science departments on the Web. So they made absolutely no effort to validate their assumptions against real-world conditions or anything approximating them. In fact, they chose a data set which was most likely to produce the kind of validation they were seeking: Web sites devoted to or created by computer scientists.
It doesn't take a genius to see that these results are stacked and obviously stacked in a very amateurish way. Even the people who claimed to have achieved cold fusion weren't this sloppy and unprofessional in their research.
But, to continue the gentle massaging of egos:
"There are a number of factors that affect our page reputation computations....
"...The first factor is how well a topic is represented on the Web. A company, for instance, may have a high reputation on a specific topic, or a person may be well known for his or her contribution in a specific field, but their home pages may not receive the same recognition mainly because the topic or the field is not well represented on the Web; or even if it is, it may not be visible among other dominant topics. This can be easily seen in some of our experiments."
The first admission of guilt. Of course, they conveniently neglected to point out the serious flaws in their methodology documented above.
"The second factor is how well pages on a topic are connected to each other. There are two extreme cases that can affect the convergence of a topic in our computations. At one extreme, there are a few pages such as the Microsoft home page (www.microsoft.com) with incoming links from a large fraction of all pages on the Web. These pages end up having high reputation on almost every topic represented in the Web; it is not reasonable to identify a small set of highly-weighted topics for them."
A minor nod to the power of marketing, but they obviously don't understand the implications of marketing on the Internet.
"At the other extreme, there are pages with no more than a few incoming links; according to some estimates (e.g. ), a large number of pages fall in this category. Depending on where the incoming links of a page are coming from and the reputations of those links, they can have various effects on the reputation of a page according to our models. Our current implementation, however, may not report any strong reputations on any topic for these pages because all incoming links are simply weighted equally."
Unfortunately, they simply didn't take into consideration the fact that no new Web page has inbound links (unless it's anticipated, but such pages are rare). The database should have been diluted with pages that had no inbound links. Had they taken this simple step, they would have learned quickly why their approach won't work.
"We have introduced general notions of page reputation on a topic, combining the textual content and the link structure of the Web. Our notions of reputation are based on random walk models that generalize the pure link-based ranking methods developed earlier. For instance, our ranking based on the one-level weight propagation model becomes PageRank if the rank is computed with respect to all possible topics. We have presented algorithms for identifying the topics that a page has highest reputation on and for computing the reputation rank of a page on a topic. Our current work concentrates on refining the implementation of TOPIC to achieve more accurate rankings and better performance."
They are pursuing the wrong methodology. Their work will therefore never be finished. I have to ask how many years it will be before they realize the serious flaw in their approach?
When Brett invited me to visit this forum, I pointed out that I would be brutal. I am unforgiving to the people who bat out these Fermatian Proofs and expect the world to just sit around and applaud their brilliance. To overlook the most fundamental aspects of the Web community in an analysis of how to properly catalogue its member sites is inexcusable.
I hope other research on which search engine technology is based is better done than this travesty, but I've already shown how the PageRank research was also seriously flawed (and how Google doesn't work as advertised) in the search engine forums at JimWorld.
Brett, I'm still looking for answers, and I'll keep my eyes on your forums here. But if this is the way search engine technology is moving, I may just continue to rely upon directories and finding my own links the hard way. Obviously these guys are not equipped to do the job for us.