Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: phranque
The Term Vector database
WTMS: A System for Collecting and Analyzing Topic-Specific Web Information
Graph Structure in the web
"Term selection. We select terms for inclusion in a page's vector by the usual Salton TF-IDF methodology (see, for example, [Baeza-Yates et al 99]). That is, we weight a term in the page by dividing the number of times it appears in the page by the number of times it appears in the collection. For a given page, we select the 50 terms with the greatest weight according to this formula."
Is this how they establish "themes" for the site? So pages are indentified by these vectors rather than URL's it seems. Maybe they compare vectors against each other, weighing in off page statistics (external vectors pointing to the page) and then come up with the top ranking pages. Sounds interesting and am merely rambling off the top of my head, but just ran across this comment,
"Despite its usefulness, term vector information for Web pages is not readily available."
although seems to be in progress ....
"To support such applications, we have built the Term Vector Database. We have populated it with term vectors for all the pages in the AltaVista index."
Great find Seth, useful for the future or even understanding the current engineering approach.
"We select terms for inclusion in a page's vector by the usual Salton TF-IDF methodology"
Here is a document on the TD-IDF weighting, maybe not the one they are referring to but a similar approach:
Check out section 2.2 Weight computation. It is a simple inverse function of mapping a set of documents (or a site) to give a vector score by comparing a page with the entire site relating to the weight of the keyword term. (Sounds like the mathematical explanation to Brett's themes) Now, if only we had all the other ranking factors like boosts if the keyword appears in location "x" or the boost from site "a" that links to site "b" and that vector score and so forth, you'd have yourselves a rudimentary algorithm.
This looks like it could be how they score a sites theme relevance
"the big jump at "50" occurs because the page filter clips all larger vectors to 50 terms""45% of terms in a vector have a count of one; 99% of terms in a vector have a count of at most 31"
This looks like how themes and link popularit work together
"Topic distillation is a technique of using hyperlink connectivity to improve
ranking of web search results (see [Kleinberg 98], [Chakrabarti et al 98], [Bharat et al 98] for specific instances of topic distillation algorithms). These algorithms work on the assumption that
some of the best pages on a query topic are highly connected pages in a subgraph of the web that is relevant to the topic. A simple mechanism for constructing this query-specific subgraph is
to seed the subgraph with top-ranked pages from a standard search engine and then expand this set with other pages in vicinity of the seed set ([Kleinberg 98]). However, this expanded set
sometimes includes highly connected pages that are not relevant to the query topic, reducing the precision of the final result. [Bharat et al 98b] identifies this problem as topic drift.
[Bharat et al 98b] shows that topic drift can be avoided by using topic vectors to filter the expanded subgraph. A topic vector is a term vector computed from the text of pages in the seed set.
A page is allowed to remain in the expanded graph only if its term vector is a good match to this topic vector. Specifically, the inner product of the two vectors is compared to a suitable relevance
threshold, and pages below the threshold are expunged from the expanded graph."
BTW, it now outranks the most popular site as determined by link count from other sites. This competitive site has 5 times the number of incoming links.
Themes is what made me come here too. Brett has long been the voice over at forums that has made sense, everything else had been degrading for us until I picked up on this themes thing and we are right back in business.
I don't really post at all anywhere else as I am not thrilled about giving out secrets. I am willing to somewhat here as Brett has been so much help to me I figure I owe to give a little back (for what it is worth). If any of you want to communicate off the record, feel free to send me an email at M_Neo111@go.com . I think this vector thing will pan out huge if we can start thinking like the engineers. All that I read today has really been enlightening, how the engines are just building off of old information retrieval concepts. I really should have seen it sooner but usually there is not much time to research.
Let's keep plugging away and see what we can find.
This looks hell of a interesting. Seth, interested to know where you found this. I like you guys, moved over here because of Brett. Helped me enormously with figuring out inks. algo in the VP forum when AOL dropped excite.
One cause for concern. A number of people got there pages nuked from the AV database for disclosing too much about AV in the VP forums. Is there any danger of that happening here?
I was doing some research about the "bowtie theory" in hopes of getting the inside track on Av's use of link popularity (graph structure of the web). When I happened to come across a link with "term vector" in it's name. As soon as I saw it I remebered it from a newsletter that I read in april that had some rumors of AV's future plans.
It's a good thing that you reminded me of this, because now that i actually went back and re-read the newsletter, I realized there are some things that I had forgotten about. In the newsletter it talks about av's future use of
- Link Popularity
- Click Popularity
- Ownership of Domain
- Term Vector
If they were right about the Term Vector (even though they didn't know what it meant) could they be right about the Click Popularity, and Domain Ownership?
You know, if you read that article on Topic extraction...wow, that is quite a road map. If you understand even half of what they are talking about, read that article folks...that is completely awesome. If you back up to the contents and start reading, it is (to a search engine junkie) fascinating. Must be 30 documents there that is required reading.
And this one on Content Relationships (sure reads like themes to me).
And lastly, my favorite Computing Web Page Reputations. (if that aint THEMES, I don't know what is).
From what I've studied so far, I think there are some bottom lines:
1) my new mantra: focus focus focus. Focus on single keyword pages related to your theme. I don't think you can over do focusing on a single keyword to build your theme.
2) cross link your site to hell and back. Don't let the indexer decide one page is "less" important than another just because it has fewer links to it within your site.
3) Confuse the indexer into not being able to decide whether to go left or right. I think you want the indexer to end up with a 'gray' area about your site. If you don't, they can pigeon hole you into a theme you don't belong in (like sew is still tops for Exotic car engines - why? I let them see "exotic", "car", and "engine" on several different pages. They decided the site was about 'exotic car engines' becuase of all the related words). If I'd gone for the 'gray area' approach, they wouldn't have been able to deduce that.
4) file this one under: 'hair brained scheme of the day'. Remember back 95-96 when everyone was putting "keywords: xyz" right on the page? Notice how all those w9 papers have the same thing? I just went and looked as several high ranking pages under weird terms here on sew, and you know what? - They were all pages about "keywords xyz" (with xyz being the weird term). hmmm. sure makes me wonder if that trick is going to start working again.
5) Sad to say, but links-in-context, really are important. Alta, Excite, Infoseek, Fast and Google haven't figured it out yet, but I'm dead certain Inktomi has link relavancy nailed stone cold.
>JamesR: wow, this is money
hehe. I like that. A belated welcome to the forums James.
That third paper really hits on some of the objections to the link popularity model and takes it a step further. These guys must have been doing calculus in kindergarten to get an equation that represents the actions of a surfer. (Remember the days when all it took was a simple keyword density analyzer on a page to get a top ranking? Dark ages....) The criteria of deciding what is a "hub" and what is an "authority" would be interesting. How many links in respect to content constitute a hub? It is obvious where a directory would fall. If this gets implemented, I wonder how a commercial business site made largely of product pages would expect to do well in ranking unless there were some partnership with an "authority" site or become the authority through content and clever linking.
Brett, here is a very strong case for your argument of the importance of internal linking. Still on the last paper in section 3.2 "Identifying Topics"
"However, in practice we may not have access to a large crawl of the Web, or we may not be able to afford the full computation. In this section, we show that it is still possible to approximately find the topics a page has a high reputation on, although the ranks will not reflect the real probability distributions. . . .If the page acquires a high rank on an arbitrarily chosen term t within the full computation of Algorithm 1, then at least one of the following must hold: (1) term t appears in page p, (2) many pages on topic t point to p, or (3) there are pages with high reputations on t that point to p. This observation provides us with a practical way of identifying the candidate terms. We simply start from page p and collect all terms that appear in it. We then look at the incoming links of the page and collect all possible terms from those pages. We continue this process until we get to a point where either there is no incoming link or the incoming links have very small effects on the reputations of page p."
In other words, if you have to rank a page on the fly and don't have access to external links, take the same model and apply it internally. This is going to cause us to really think through and carefully position links in a site or you could easily take a wrong turn right off the bat and accidentally mislead a search engine spider.
Here is another important piece further down:
"Again, every page which is not identified in Step 1 is assumed to have a rank of zero. Note that the hubs-and-authorities computation of Kleinberg is a special case of this method; it is based on only identifying pages that either contain term t or are reachable within one link from one such page."
Interesting way to filter. You know what happens when you start multiplying by zero. The "one link away" approach is interesting and could really help to narrow in on who to link to. Can't get a link from Microsoft? That's OK, find out who they link out to and target them. In theory the theme score would filter down.
Also Section 5: Experimental Evaluation
"Only a limited number of incoming links are examined; we obtain at most 500 incoming links of a page, but the number of links returned by the search engine, currently Alta Vista , can be less than that." AV doesn't seem to have it all together yet but operating this approach on a basic level. All this is pretty incredible. If someone could get some dialogue going with a mathematician, it would be nice to start gathering clear idea of the desired term weight of the entire site. My question is should the term frequency on the site be high or low? It is obvious that the phrase needs to be mentioned on every page but how often? Every instance of the term appearing is going to affect the algorithm in some way. With all the inverting they are doing, I can't tell off hand what would be most favorable.
AV doesn't seem to have it all together yet but operating this approach on a basic level. All this is pretty incredible. If someone could get some dialogue going with a mathematician, it would be nice to start gathering clear idea of the desired term weight of the entire site. My question is should the term frequency on the site be high or low? It is obvious that the phrase needs to be mentioned on every page but how often? Every instance of the term appearing is going to affect the algorithm in some way. With all the inverting they are doing, I can't tell off hand what would be most favorable.
"If someone could get some dialogue going with a mathematician, it would be nice to start gathering clear idea of the desired term weight of the entire site"
The term vector weight computation is a pretty basic formula (compared to others listed in these articles). I could figure out the term vector weight of a small site pretty easily, the problem would be that we would have nothing to compare it to. We would need to get a sampling of top ranking pages which could get rather difficult when large sites are involved (mainly because you would need to know how many times each word would be used on every page of the site). If you guys would be interested in trying to figure this out we would need to have a spider to collect all this information for us and some place a little more private to figure everything out.
Could anyone please point me to the thread regarding "Brett's 'themes'".
The one part of the first article I didn't understand is the section 2.2 weight computation. I am hoping 'Brett's themes', which James_R refers to in not so many words as the "non-math" version of the weighting computation, will help me to understand what it is.
Thanks for helping a novice.
Notice the probabalistic ranking methods along with vector space models in that slide show. Ray says these are more popular.
My question to him was, since that slide show is two years old, has there been any findings as to which method (vector or probabalistic) is more effective for information organization?
Here was his email back to me:
These methods are undergoing continuing testing in the NIST TREC program.
Generally the "preferred algorithm" these days is the probabilistically-
based OKAPI BM-25 algorithm.
see [trec.nist.gov ]
NIST - National Institute of Standards and Technology
TREC - Text Retrieval Conference
If any of you are brave enough to venture to that site and wade through the publications section (you thought those term vector database pages were long and confusing), look for the publications by S.E. Robertson. He worked with the Okapi algorithms. BTW, have your adobe readers and distillers (for the older conferences) ready.
There is a math formula that can be found. It is more complex than the TF-IDF equation. Also, it appears to me, similar to the vector based method in that they look at the collection of documents as well which indicates themes. This is what I got out of it, after 45 minutes of staring bug-eyed at a calculus equation and it's variables. :)
Seriously, there is a lot of publications from these conferences that go back to like 1991. I can't spend all the time they require to do the research, but it would be interesting to talk to someone who did. Maybe a student.
Next, I mentioned to Ray that it seemed AltaVista was using a vector-based algorithm and that this would especially make sense since they are at the cutting edge of this stuff.
I assume that AltaVista is using a vector approach. That still gives
good performance, in general, and many of the research vector systems
are using a form of the OKAPI weights.
So, it seems everyone was on track with term vectors anyway, at least here with Alta. Maybe it was a waste of time but I find this research very interesting.
Just thought I'd share my misadventures in SE research with everyone...
Edited by: metaman
Also I've been doing some additional research into the okapi research methods. From everything that I found, it seems as though these methods are still in the experimental stages. I couldn't find any example of commercial uses, however below is a link to the Brits that developed the okapi system, and on their site is a link to an okapi system that's is available through a VT100 interface.
I'm still having some doubts about AV using what seems to be a somewhat experimental weighting method, But it's definitely worth further research.
After getting two college degrees and coming within 3 classes of getting two more, I reached the conclusion that sometimes the academic community takes itself far to seriously.
I will admit I am still wading through all the hokum that you guys have found on these other Web sites, but basically what it boils down to (so far) is this:
If you create a Web site, send out press releases and get other Web sites to link to you, and CALL YOURSELF AN AUTHORITY on your own Web site, you will get a high ranking in search engines that rely upon page and link relevance.
As I have maintained in other forums, these sorcerous devices for obtaining relevant search results are causing me no end of grief in my own efforts to be ranked highly AND to use the search engines for my own searches.
There is no intelligent way to determine if the content of a Web page is authoritative from the inbound links. Someday someone is going to explain it to these people and they will understand why surfers get frustrated with their irrelevant results.
Brett, I'm here. I won't promise to hang around any longer than I'm hanging around the other forum, but I do appreciate the efforts people are making to share information.
I just wish it weren't so disappointing. You'd think that people who are supposed to earn their livings from designing search engines would apply a little common sense to what they are doing.
And I CAN show how these engines (Alta Vista particularly, right now) are not doing their job because of too much reliance upon this pseudo-science that pretends it can determine the authoritativeness of a Web site.
I am ashamed to admit I even have computer related degrees based on the absolute nonsense I have waded through in these papers. How did these guys ever get doctorates in the first place? Out of a cereal box?
[Angry podium mode off]
"Calculating Web page reputations".
These guys obviously haven't looked at any Web surfing behavioral studies. Their base assumptions are all incorrect. There is no such thing as a "random surfer", except for spiders. Surfers looking for information are directed in their link choices by the Web designer's presentation of the links.
Furthermore, no link to any Web page is an indication of the accuracy of the information of that page. All it indicates is that one human being was willing to link to another human being's Web site. The fact of the linkage doesn't imply that the linking Web page (P) is knowledgeable about the topic, let alone authoritative, nor that the linked Web page (Q) is knowledgeable about the topic, let alone authoritative.
1000 links will not make a badly written paper posted on the Web any more knowledgeable or authoritative on a given topic than 1 link.
Finally, and this is where I stop analyzing this nonsense (because, as anyone who has submitted a proof for examination knows, all arguments fail at the first flaw), they write:
"In the setting of the Web, our assumption that a page has at least either one incoming link or one outgoing link may not hold. However, since we are dealing with collections of pages collected by crawling, we feel justified in assuming they all have at least one incoming link."
WRONG. There is absolutely no justification for this secondary assumption. New Web pages are regularly submitted to the search engines by Webmasters to be crawled. The engines do not get to those pages through other pages. The entire concept breaks down because of this one stupid assumption that anyone who has created a Web page and submitted to the search engines knows is not true.
The database is being flooded with pages which break the model. Since the model is broken, the algorithm based on the model won't work as expected. It will, in fact, deliberately and consistently filter out pages that are both relevant and authoritative in favor of POPULAR Web pages which may or may not have merit.
The bottom line is that the best marketer will win in the long run because this model favors a well-marketed Web site over a poorly marketed Web site.
So much for "reputations". Web page reputations mean nothing to the individual surfer, who is more likely to be influenced by wording and tagging, as well as placement of links (and THAT has been established in Web surfing behavioral studies, which I have read, unlike these guys).
Now, on to "identifying topics".
"In this section, we show that it is still possible to approximately find the topics a page has a high reputation on, although the ranks will not reflect the real probability distributions."
Well, considering that their probability functions don't work the way they should, that shouldn't be a real problem for them.
"Given a page p and a parameter d > 0, suppose we want to find the reputations of the page within the one-level influence propagation model. If the page acquires a high rank on an arbitrarily chosen term t within the full computation of Algorithm 1, then at least one of the following must hold: (1) term t appears in page p, (2) many pages on topic t point to p, or (3) there are pages with high reputations on t that point to p...."
First flaw in the argument: they are assuming that they have correctly established the reputations of Web pages. They have not. Hence, everything fails from this point on (and this is the beginning of the algorithm).
But I'll be gentle. I'll continue.
"...This observation provides us with a practical way of identifying the candidate terms. We simply start from page p and collect all terms that appear in it. We then look at the incoming links of the page and collect all possible terms from those pages. We continue this process until we get to a point where either there is no incoming link or the incoming links have very small effects on the reputations of page p. Let us denote the maximum number of iterations by k. The algorithm can be expressed as follows:"
The above included just to document their methodology. Since the methodology relies upon an obviously flawed model, we know it won't produce correct results.
"Duality of Terms and Pages"
Granting that their functions DO return result sets, what follows is essentially correct. But the old Garbage In/Garbage Out principle still applies:
"Indeed, if we fix p in equations 1, 3, 4 to a specific page, we will find the reputation ranks of the page for every possible topic t. We may then report the topics with the highest reputation ranks. If we fix instead t in the same equations to be a specific topic, we will find the reputation ranks of every page on topic t. Again, we may report those pages with high reputation ranks first in the answer to a query."
Hence, they have diluted their database with improperly ranked pages, so all they can do at this point is optimize the search over the bad rankings.
"In this section, we describe a preliminary evaluation of our approach. Since we did not have access to a large crawl of the Web, it was not feasible to do the full rank computations of Section 3.1. We also did not fully implement the approximate algorithms suggested in Section 3.2 due to the limitations imposed by the search engines we used, either on the maximum number of entries returned for a query or on the response time."
Of course, even if they did have access to a "large crawl of the Web", they still would be arbitrarily excluding NEW Web pages, which make up a significant percentage of content in any search engine which accepts submissions (either directly, as Alta Vista and Google do, or indirectly, as Inktomi does).
So, once again, they proceed from a highly flawed model.
Then they don't even bother to implement the full methodology they have just described in the previous sections. May the Food and Drug Administration never approve medications in this fashion. The sloppy approach to validating their incorrect assumptions is simply mind-staggering. That anyone would give credence to this nonsense is even more mind-blowing.
"Known Authoritative Pages"
5.1 Known Authoritative Pages
"In our first experiment, we picked a set of known authoritative pages on queries (java) and (+censorship +net), as reported by Kleinberg's HITS algorithm , and computed the topics that each page was an authority on. As shown in Figure 1, the term ``java'' is the most frequent term among pages that point to an authority on Java."
Okay, this test need not be diced in detail. Basically, they picked the test most likely to validate their theory to see if the theory is easily validated. If it had failed this test, they would have stopped and revised their functions (and, who knows, maybe they had to do a fair amount of revision just to pass this test).
"In another experiment, we used Inquirus , the NECI meta-search engine, which computes authorities using an unspecified algorithm. We provided Inquirus with the query (``data warehousing'') and set the number of hits to its maximum, which was 1,000, to get the best authorities, as suggested by the system. We picked the top four authorities returned by Inquirus and used our system to compute the topics those pages have high reputations on. The result, as shown in Figure 3, is again consistent with the judgments of Inquirus."
This is a bolder test. They were applying their functions to a randomly selected data set (they didn't know how it would be selected). But notice that they used a term, "data warehousing", that is frequently brought up within the fields of computer science, data processing, and related subjects. These are not fields which are normally subjected to intensive marketing techniques. So this random test set is actually a more predictable test set. The meta search engine was most likely designed by people who make the same kinds of assumptions as these guys.
Since I cannot prove that, all I can say is that this test has questionable value. They have not validated the meta search engine they selected (let alone their selection methodology). The standard of completeness applied to the experimentation is thus very minimal.
"Personal Home Pages"
"In another experiment, we selected a set of personal home pages and used our system to find the high reputation topics for each page. We expected this to describe in some way the reputation of the owner of the page. The results, as shown in Figure 4, can be revealing, but need to be interpreted with some care. Tim Berners-Lee's reputation on the ``History of the Internet,'' Don Knuth's fame on ``TeX'' and ``Latex'' and Jeff Ullman's reputation on ``database systems'' and ``programming languages'' are to be expected. The humour site Dilbert Zone  seems to be frequently cited by Don Knuth's fans. Alberto Mendelzon's high reputation on ``data warehousing,'' on the other hand, is mainly due to an online research bibliography he maintains on data warehousing and OLAP in his home page, and not to any merits of his own."
I've looked at the table. Basically, the data set is flawed. It's too small a sample to be of any value. The largest number of links available was 1733 and all four pages evaluated were taken from academic servers (if I may be allowed to include the www.w3.org server in the academic world). Most personal pages are not devoted to computer science related themes, and are not found on academic servers.
Their last test went after Computer Science departments on the Web. So they made absolutely no effort to validate their assumptions against real-world conditions or anything approximating them. In fact, they chose a data set which was most likely to produce the kind of validation they were seeking: Web sites devoted to or created by computer scientists.
It doesn't take a genius to see that these results are stacked and obviously stacked in a very amateurish way. Even the people who claimed to have achieved cold fusion weren't this sloppy and unprofessional in their research.
But, to continue the gentle massaging of egos:
"There are a number of factors that affect our page reputation computations....
"...The first factor is how well a topic is represented on the Web. A company, for instance, may have a high reputation on a specific topic, or a person may be well known for his or her contribution in a specific field, but their home pages may not receive the same recognition mainly because the topic or the field is not well represented on the Web; or even if it is, it may not be visible among other dominant topics. This can be easily seen in some of our experiments."
The first admission of guilt. Of course, they conveniently neglected to point out the serious flaws in their methodology documented above.
"The second factor is how well pages on a topic are connected to each other. There are two extreme cases that can affect the convergence of a topic in our computations. At one extreme, there are a few pages such as the Microsoft home page (www.microsoft.com) with incoming links from a large fraction of all pages on the Web. These pages end up having high reputation on almost every topic represented in the Web; it is not reasonable to identify a small set of highly-weighted topics for them."
A minor nod to the power of marketing, but they obviously don't understand the implications of marketing on the Internet.
"At the other extreme, there are pages with no more than a few incoming links; according to some estimates (e.g. ), a large number of pages fall in this category. Depending on where the incoming links of a page are coming from and the reputations of those links, they can have various effects on the reputation of a page according to our models. Our current implementation, however, may not report any strong reputations on any topic for these pages because all incoming links are simply weighted equally."
Unfortunately, they simply didn't take into consideration the fact that no new Web page has inbound links (unless it's anticipated, but such pages are rare). The database should have been diluted with pages that had no inbound links. Had they taken this simple step, they would have learned quickly why their approach won't work.
"We have introduced general notions of page reputation on a topic, combining the textual content and the link structure of the Web. Our notions of reputation are based on random walk models that generalize the pure link-based ranking methods developed earlier. For instance, our ranking based on the one-level weight propagation model becomes PageRank if the rank is computed with respect to all possible topics. We have presented algorithms for identifying the topics that a page has highest reputation on and for computing the reputation rank of a page on a topic. Our current work concentrates on refining the implementation of TOPIC to achieve more accurate rankings and better performance."
They are pursuing the wrong methodology. Their work will therefore never be finished. I have to ask how many years it will be before they realize the serious flaw in their approach?
When Brett invited me to visit this forum, I pointed out that I would be brutal. I am unforgiving to the people who bat out these Fermatian Proofs and expect the world to just sit around and applaud their brilliance. To overlook the most fundamental aspects of the Web community in an analysis of how to properly catalogue its member sites is inexcusable.
I hope other research on which search engine technology is based is better done than this travesty, but I've already shown how the PageRank research was also seriously flawed (and how Google doesn't work as advertised) in the search engine forums at JimWorld.
Brett, I'm still looking for answers, and I'll keep my eyes on your forums here. But if this is the way search engine technology is moving, I may just continue to rely upon directories and finding my own links the hard way. Obviously these guys are not equipped to do the job for us.
Your right, research publication and studying the makeup of the bow tie web is all the rage with the cap & gown set. The whole thing we were interested about in this thread, was the fact that so many of those docs mentioned were by current admins at the se's. Broder, Page, etal. That is good background on their line of thinking.
Some of them right or wrong? It really doesn't matter since they are going to do as they choose - we are just trying to figure out where it is all going.
Like that one on calculating term vectors, or figuring out how to determine a sites topic solely off a hand full of links. It can have an effect on how we optimize and run our sites. No, we are going to crack any algos reading term papers for a conference, but it does give us a slight road map. That one on topic distillation, is very informative stuff and probably 90% of the reason I shut down bl - it is good real-world applicable information.