homepage Welcome to WebmasterWorld Guest from 54.211.180.175
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
Forum Library, Charter, Moderators: phranque

SEM Research Topics Forum

    
Alta Research part 2
Brett_Tabke




msg:818499
 4:17 am on Aug 10, 2000 (gmt 0)

Continuation of This thread [webmasterworld.com].

 

seth_wilde




msg:818500
 4:29 am on Aug 10, 2000 (gmt 0)


Hi Michael, I would first like to welcome you to WebmasterWorld!

Alright now it's time to get down to business (I've put any additional quotes from the paper that I have included in ()'s, so they won't get confused with your quotes)

"There is no intelligent way to determine if the content of a Web page is authoritative from the inbound links. Someday someone is going to explain it to these people and they will understand why surfers get frustrated with their irrelevant results"

What I've seen on AV totally goes against this. Since they have implemented their new technology the relevancy of their search results has greatly improved and you see much less spam.

"Let's look at the nonsense posted here: [www9.org...]

Before anyone gets too worried I would like to point out that this is just a research paper about how to improve "page rank" (written by some professors at the university of toronto) There is no evidence that these method are currently used or ever will be used.

"Furthermore, no link to any Web page is an indication of the accuracy of the information of that page. All it indicates is that one human being was willing to link to another human being's Web site."

They address this issue, Infact that's the reason why they wrote this paper. They realized the flaws in google's "page rank" and are trying to improve upon it by only using pages with relevant content when scoring link popularity. This logic might not be perfect but it definitely decreases the effectiveness of link clubs and press releases. And you can't deny that if a site is about cats and it has a thousand other sites about cats linking to it, it's probably going to be good authority on cats.

below I have included the exerpt from the paper that talks about what links actually mean

(However, there are some difficulties in formalizing the concept of ``reputation'' effectively. The assumption that links are endorsement suggests that the number of incoming links of a page indicates its reputation. But in practice, links represent a wide variety of relationships such as navigation,
subsumption, relatedness, refutation, justification, etc. In addition, we are interested not just in the overall reputation of a page, but in its reputation on certain topics.)

""In the setting of the Web, our assumption that a page has at least either one incoming link or one outgoing link may not hold. However, since we are dealing
with collections of pages collected by crawling, we feel justified in assuming they all have at least one incoming link."

WRONG. There is absolutely no justification for this secondary assumption. New Web pages are regularly submitted to the search engines by Webmasters to be crawled. The engines do not get to those pages through other pages. The entire concept breaks down because of
this one stupid assumption that anyone who has created a Web page and submitted to the search engines knows is not true."

I agree with you that this reasoning is slightly flawed because you will miss some good sites. But from the search engines perspective I think it works well. Think how easy this makes it for them to filter out countless doorway pages. Besides, I think that most good sites do
have a least a few links to them and we can probably live without most of the ones that don't. (inktomi and google have been doing this for some time now)

"Since I cannot prove that, all I can say is that this test has questionable value. They have not validated the meta search engine they selected (let alone their selection methodology). The standard of completeness applied to the experimentation is thus very minimal."

I'll agree this experiment doesn't have a whole lot of value. But I don't think it was designed to. Basically all they did was compare the results from a meta search engine using similiar authoritive methodolgy to their own results and found that they matched up. Nothing
ground breaking, I guess they found it re-assuring

"Their last test went after Computer Science departments on the Web. So they made absolutely no effort to validate their assumptions against real-world conditions or anything approximating them. In fact, they chose a data set which was most likely to produce the kind of validation they were seeking: Web sites devoted to or created by computer scientists.

It doesn't take a genius to see that these results are stacked and obviously stacked in a very amateurish way. Even the people who claimed to have achieved cold fusion weren't this sloppy and unprofessional in their research"

I don't think these results are stacked at all. Analyzing the homepages of professors are as good as anything else. And they did get results contrary to what these professors are actually known for. I think if stacked results they would be more in my favor.

(The results, as shown in Figure 4, can be revealing, but need to be interpreted with some care. Tim Berners-Lee's reputation on the``History of the Internet,'' Don Knuth's fame on ``TeX'' and ``Latex'' and Jeff Ullman's reputation on ``database systems'' and ``programming languages'' are to be expected. The humour site Dilbert Zone [6] seems to be frequently cited by Don Knuth's fans. Alberto Mendelzon's high reputation on ``data warehousing,'' on the other hand, is mainly due to an online research bibliography he maintains on data warehousing and OLAP in his home page, and not to any merits of his own)

Well I think that's I'll I have time to say right now, but I think you should check out this paper [www9.org...] It's co-wrote by AV engineeres and is much closer to what av is using than this experimental paper.

Michael Martinez




msg:818501
 5:16 am on Aug 10, 2000 (gmt 0)

Sorry, Seth. I had so many windows open that I got confused on who I was responding to. I'm changing the "Brett" to "Seth" where I find it.

>"There is no intelligent way to determine if the content
>of a Web page is authoritative from the inbound links.
>Someday someone is going to explain it to these people and
>they will understand why surfers get frustrated with their
>irrelevant results"

>What I've seen on AV totally goes against this. Since they
>have implemented their new technology the relevancy of
>their search results has greatly improved and you see much
>less spam.

Wish I could say the same, Seth. I've found a little more time to do more searches on Alta Vista today. They have blown their reliability out the door with whatever changes they made recently. If that means they are implementing the system described in that nonsense paper, that's why, then.

>"Furthermore, no link to any Web page is an indication of
>the accuracy of the information of that page. All it
>indicates is that one human being was willing to link to
>another human being's Web site."

>They address this issue, Infact that's the reason why they
>wrote this paper. They realized the flaws in
>google's "page rank" and are trying to improve upon it by
>only using pages with relevant content when scoring link
>popularity. This logic might not be perfect but it
>definitely decreases the effectiveness of link clubs and
>press releases. And you can't deny that if a site is about
>cats and it has a thousand other sites about cats linking
>to it, it's probably going to be good authority on cats.

This DOES NOT WORK, Seth. It never has, it never will. How many examples (from Alta Vista's own database) would you like me to post to make my point.

I have expert knowledge in several fields. I can easily debunk the authoritative ranking of numerous pages in Alta Vista's search results. Why do you think I keep going past the 1st, 2nd, and 3rd pages?

Just because 1000 people link to a page doesn't mean the person who created the page knows what they are talking about. Don't make the mistake of buying into this nonsense.

I understand the search engines believe in it heart and soul, but look at the lousy way these guys are validating their theories? That's like walking into a poker game with a marked deck and demanding to be the dealer.

Here's a topic you've never seen me discuss at the other forum, and I have no Web pages on this topic, so at the very least people cannot accuse me of being upset at not getting a top ranking: "celtic history".

Search for it on Alta Vista (no quotes) and the first page to come up (on the search I've just run -- I understand there may be some variation over time due to several factors) is [wwwvms.utexas.edu...] (titled "Irish History on the Web").

Hey, the Irish are Celts, but there is a lot more to Celtic history than Irish history. Look at the page. You'll see it's pretty focused on Irish history. There is something wrong with an algorithm that puts an Irish history page first in a search for "celtic history".

Number 3 on the list is [ancientsites.com...] which is, I'm sure, a very pretty site about ancient peoples, but it's NOT about "celtic history".


Number 4 is [gliffaeshotel.com...] a hotel's homepage. Now THERE is a lesson in Celtic history if ever I've seen one!

Where are the "celtic history" pages?

"Celtic Connections" comes in at number 6. So much for relevance.

Let's try another search, again for a topic on which I DO NOT have any Web pages: "business basic". This is a programming language I worked with for 20 years. There are probably fewer than 200 Web sites devoted to the subject, but several of the language vendors (including one of my former employers) have Web sites. Who comes up first? [bizinc.com...] a site which has nothing to do with "business basic". Of the first ten results, five have nothing to do with "business basic" (number 2 is the home page for an independent business basic resource -- they got lucky).

This game is too easy to play. If the system really worked, then it wouldn't be so easy to find examples that contradict the claims of its success.

This particular paper does not address the fundamental failures of the basic concept. The "authority" assigned to pages based on linking relationships is at best an arbitrary authority which has nothing to do with relevance to the topic.

>""In the setting of the Web, our assumption that a page
>has at least either one incoming link or one outgoing link
>may not hold. However, since we are dealing
>with collections of pages collected by crawling, we feel
>justified in assuming they all have at least one incoming
>link."

>WRONG. There is absolutely no justification for this
>secondary assumption. New Web pages are regularly
>submitted to the search engines by Webmasters to be
>crawled. The engines do not get to those pages through
>other pages. The entire concept breaks down because of
>this one stupid assumption that anyone who has created a
>Web page and submitted to the search engines knows is not
>true."

>I agree with you that this reasoning is slightly flawed
>because you will miss some good sites. But from the search
>engines perspective I think it works well. Think how easy
>this makes it for them to filter out countless doorway
>pages. Besides, I think that most good sites do
>have a least a few links to them and we can probably live
>without most of the ones that don't. (inktomi and google
>have been doing this for some time now)

Seth, it DOES NOT WORK. The reasoning is not SLIGHTLY flawed, it's majorly flawed. What they are filtering out is more than just doorway pages. You and I both know other factors have to be taken into consideration to keep more than just the old-style doorway pages from being knocked out of the database. The assumption stated above is completely wrong. Fallacious logic like this explains why we have so much trouble finding stuff on the search engines.

>"Their last test went after Computer Science departments
>on the Web. So they made absolutely no effort to validate
>their assumptions against real-world conditions or
>anything approximating them. In fact, they chose a data
>set which was most likely to produce the kind of
>validation they were seeking: Web sites devoted to or
>created by computer scientists.

>It doesn't take a genius to see that these results are
>stacked and obviously stacked in a very amateurish way.
>Even the people who claimed to have achieved cold fusion
>weren't this sloppy and unprofessional in their research"

>I don't think these results are stacked at all. Analyzing
>the homepages of professors are as good as anything else.
>And they did get results contrary to what these professors
>are actually known for. I think if stacked results they
>would be more in my favor.

No, Seth. What they analyzed was a closed system. Analyzing a close system doesn't provide any useful insight into the structure of an open system. This paper should not survive peer review. Its assumptions are invalid, its testing methodology is faulty, and the conclusions are weak.

You can't just run around and say, "Oh, we checked this against a group of pages that link to each other, so we know it works." You have to check it against a scientifically valid sample (theirs was way too small).

Solid research doesn't just toss out assumptions like they are indisputable facts. It establishes what the facts are and ensures they cannot be disputed.

I'll look at the AV engineer paper and get back to you. But the experimental one is pure tripe. There is virtually nothing scientifically valid in it.

Edited by: Michael Martinez

Michael Martinez




msg:818502
 5:51 am on Aug 10, 2000 (gmt 0)

[www9.org...]

Initial comment: This appears to be a fairly well-written white paper. But if that is all it is, it reveals nothing useful for analysis of their methodology.

Problem number 1:
"We eliminate the least frequent third because they are noisy and do not provide a good basis for measuring semantic similarity. For example, one such term is hte, a misspelling of the. This term appears in a handful of pages that are completely unrelated semantically. However, because this term is so infrequent, its appearance in term vectors makes those vectors appear to be quite closely related."

They have just killed themselves on jargon-based content. They already conceded that this methodology only works on English-language sites because they are using Porter stemming.

This is not a fatal flaw in the argument, however.

Problem number 2:
"Term selection. We select terms for inclusion in a page's vector by the usual Salton TF-IDF methodology (see, for example, [Baeza-Yates et al 99]). That is, we weight a term in the page by dividing the number of times it appears in the page by the number of times it appears in the collection. For a given page, we select the 50 terms with the greatest weight according to this formula."

They do not define what constitutes "the collection". This makes the paper weak (it should not pass peer review) but it's not necessarily a fatal flaw. They just need to add the definition to provide some clarity.

Term Weight = page occurrences / collection occurrences

50 top terms in each page get selected. I can see how this would defeat keyword analysis-driven design. Analyzing keywords on a single page no longer matter (except in 1-page collections -- but what is a "collection"?).

Their weights definition limits the scope of the database they are describing with respect to an analysis of how the search engine functions. They imply there is a layer of applications running against the database. The applications thus apply their own algorithms, which can be anything.

The API statement says they are using unique integer identifiers for the Web pages. They are returning lists of identifiers that point to and are pointed to by any given URL.

In the data representation they say they are throwing out terms which don't fit into the 128-bit storage format, so they are reducing their 1/3 "middle terms" (hopefully by an insignificant fraction, but so far no algorithm has been outlined, let alone detailed as a function).

The software components section refers back to the definitions section.

Okay, the experience section says they lost under 2% of the vectors during the build, so the data format is fairly efficient.

TOPIC DISTILLATION

This is where we come to the meat. Unfortunately, they refer you to several examples of topic distillation algorithms and don't tell you which one they are using.

On the other hand, they say:
"These algorithms work on the assumption that some of the best pages on a query topic are highly connected pages in a subgraph of the web that is relevant to the topic."

In Computer Science we have an old saying. When you ASSUME something, you make an A** out of U and ME.

They don't even bother to validate this assumption.

The methodology fails right here at this point. They seeded the results with selections from search engine results. Since they aren't even trying to vet the seed data, the whole ranking scheme is defeated.

It's humanly impossible to vet 1,000,000,000+ Web pages. I understand that. I also understand that if you seed a vetting algorithm (which is what this is attempting to be) with unvetted data, you're going to get untrustworthy results.

And that's what you get with Alta Vista right now.

The CLASSIFYING WEB PAGES category further invalidates the methodology. They are using Yahoo!'s top twelve categories for their subject definitions. Yahoo! is hardly the leader in document indexing science. I wonder if they are even academically accepted in document management programs.

Furthermore, they actually state they are using Yahoo! data to establish their seed terms. Doesn't this fact send the alarm klaxons screaming down the hallways for anyone?

Since when is Yahoo! a reliable source of information on what the Web looks like?

Let me demonstrate the serious flaw in this philosophy. I have over 60 well-defined, rich content Web sites on my primary domain. Yahoo! lists a total of FOUR of them. Four. If they're going to screw me (and two of my FOUR sites are listed as COOL sites), then they are certainly going to screw a lot of other people.

So the Yahoo! data sampling is not relevant to mapping the reputation of the Web. That's the most absurd assumption I've seen yet in either of these two papers.

My analysis ends here. There is no further point in trying to salvage anything useful from this paper. It clearly demonstrates WHY Alta Vista is not returning valid result sets.

Michael Martinez




msg:818503
 6:01 am on Aug 10, 2000 (gmt 0)

Brett,

>Hey MM, thanks for dropping by.

I hope I don't make you regret the invitation. :)

>Your right, research publication and studying the makeup
>of the bow tie web is all the rage with the cap & gown
>set. The whole thing we were interested about in this
>thread, was the fact that so many of those docs mentioned
>were by current admins at the se's. Broder, Page, etal.
>That is good background on their line of thinking.

I'm sure they are very intelligent, well-educated people. But I'm not going to soft-sell my criticism of their choices.

On the other hand, my extremely belligerent disagreement with their methodologies doesn't mean I feel I have anything better to offer. I just don't believe in using a bad system because it's the best of the bad systems.

>Some of them right or wrong? It really doesn't matter
>since they are going to do as they choose - we are just
>trying to figure out where it is all going.

Agreed. Whatever they do, we have to live with it.

>Like that one on calculating term vectors, or figuring out
>how to determine a sites topic solely off a hand full of
>links. It can have an effect on how we optimize and run
>our sites. No, we are going to crack any algos reading
>term papers for a conference, but it does give us a slight
>road map. That one on topic distillation, is very
>informative stuff and probably 90% of the reason I shut
>down bl - it is good real-world applicable information.

Well, my initial reaction to the two papers I've read is that I could probably get back my high rankings by moving a majority of my content to secondary domains and linking between those domains and the main index pages for my content sites. That would not be "spamming" the index by any definition Alta Vista uses, but it would surely be a royal pain in the you-know-what for me.

I have a sneaking suspicion that they are defining a "collection" to be anything from a domain to an "account" within a domain. Maybe a sub-domain also qualifies as a collection.

I am not pleased with what I have learned here today.

seth_wilde




msg:818504
 6:19 am on Aug 10, 2000 (gmt 0)

"Wish I could say the same, Seth. I've found a little more time to do more searches on Alta Vista today. They have blown their reliability out the door with whatever changes they made recently. If that means they are implementing the system described in that nonsense paper, that's why, then."

I guess the reliability issue is a matter of opinion. AV's current system is by no means perfect, but they've made tremendous progress and I don't remember it ever being more accurate.

As far as AV using the system in this paper, they don't, and I don't think they ever will. (term vector databases make it much easier to seperate relevant and non relevant links)

"This DOES NOT WORK, Seth. It never has, it never will. How many examples (from Alta Vista's own database) would you like me to post to make my point"

Link popularity on it's own doesn't work. It's when it's only determined by relevant links and combined with themes that it starts to work. Although we can sit here all night and still disagree on this part, can you think of a better system? We already know that just indexing pages is way to easily abused by spammers, mining the link structure of the web is the next step, it may not be perfected yet, but it's still the next step.

"Just because 1000 people link to a page doesn't mean the person who created the page knows what they are talking about. Don't make the mistake of buying into this nonsense."

It doesn't mean that they know anything, but neither does any other ranking method, so what makes this worse?. Besides if 1000 people link to a site that has similar content to their own (assuming these are true links) wouldn't you expect they found some value in that site's content? (even if the site author was full of bs)

"Let's try another search, again for a topic on which I DO NOT have any Web pages: "business basic". This is a programming language I worked with for 20 years. There are probably fewer than 200 Web sites devoted to the subject, but several of the language vendors (including one of my former employers) have Web sites. Who comes up first? [bizinc.com...] a site which has nothing to do with "business basic". Of the first ten results, five have nothing to do with "business basic" (number 2 is the home page for an independent business basic resource -- they got lucky)."

Ok if we're going to talk about relevancy you can't just count what you had in mind when you preformed the search. Sure bizinc.com may not be about business basic programming, but their company name is Business Basic Inc. it's not hard to see how AV picked that site. Infact the first 3 results all have business basic in their company/site name. This is probably more of an issue of themes (company name on every page) than and issue of link popularity.

"No, Seth. What they analyzed was a closed system. Analyzing a close system doesn't provide any useful insight into the structure of an open system. This paper should not survive peer review. Its assumptions are invalid, its testing methodology is faulty, and the conclusions are weak."

I guess I don't understand what you find "closed" about this system. Sure they're professors personal homepages and probably have a defined userbase, but I think this is true with most personal homepages. Anyone can view these pages and anyone can't link to these pages. What's closed about this?

"Solid research doesn't just toss out assumptions like they are indisputable facts. It establishes what the facts are and ensures they cannot be disputed."

I don't think that was the point of this paper. I think it was more about recoginzing that "page rank" is flawed, and trying to find ways to improve it. They weren't trying establish indisputable facts, they we're trying to come up with new ideas.


seth_wilde




msg:818505
 6:31 am on Aug 10, 2000 (gmt 0)

MM your just too damn fast :) everytime I post something you've already posted another....

It's getting pretty late, I'll have to see if I have time to debate more tommorow..

Anyway I'm glad that you learned some new info, even if you didn't like it :) Even though we disagree about some of the methodology behind it, you'll have to stop by more often and help us find out how to make it work in our favor.

Michael Martinez




msg:818506
 6:53 am on Aug 10, 2000 (gmt 0)

"Although we can sit here all night and still disagree on this part, can you think of a better system?"

Yes. And they used to have one. All the search engines did. Unfortunately, that system was vulnerable to abuse.

Search engines today are not trying to just index the content. They are also trying to filter it. This hair-brained vector/term/link thingee isn't really about getting the best Web sites to the surfer, it's about keeping the surfer away from the spam.

I'm a writer. I produce content. I produce it every day. I don't have time to play these nonsense games with the search engines' filters and ranking algorithms. I provide good, high-quality content, and most of it will never be linked to by hundreds or thousands of Web sites. I have to depend on the main index page for my domain to get those links (and on Alta Vista, over 2,000 sites link to my domain).

You don't really think I'm the only person being hurt by this system, do you?

I was getting better search results last week from Alta Vista. I wouldn't be here now if I were still getting those results. I use the search engines every day to help me with my research. I beat the search engines to death. I know where I'm getting good results. Since I've relied upon Ixquick more and more lately it didn't matter that Alta Vista was dropping out of my search results. Other engines and directories fill the void.

But when I noticed my own rankings on Alta Vista were dropping (but fortunately having virtually no effect on my overall traffic), I became concerned that something had changed which might have long-term effects on my traffic that I need to understand. So, here I am, looking for answers.

Now, to address individual points:

>"Just because 1000 people link to a page doesn't mean the person
>who created the page knows what they are talking about. Don't
>make the mistake of buying into this nonsense."

>It doesn't mean that they know anything, but neither does any
>other ranking method, so what makes this worse?. Besides if 1000
>people link to a site that has similar content to their own
>(assuming these are true links) wouldn't you expect they found
>some value in that site's content? (even if the site author was
>full of bs)

I run my own topic-specific directories, Seth. I know the popular sites for many topics. Most of them, believe it or not, are "fluff" sites. They don't have much in the way of content. What they DO have are good graphics and wowsy designs. Oboy.

Judging a Web site by the number of people who link to it is like judging the intelligence of a person by the number of people who pick their picture off the wall. The most attractive pictures will get picked most often. The most attractive Web sites will also get picked most often.

What's popular doesn't mean anything. There are indeed better algorithms for ranking sites by content. Quality can never be determined by an algorithm. You have to let the surfer decide that for him or herself. The surfer knows better than the programmer what he or she is looking for, but he or she may still be fooled by whatever is served up. Helping to fool the people isn't helping them at all.

>Ok if we're going to talk about relevancy you can't just count
>what you had in mind when you preformed the search.

Then what is the point in searching? If anything is relevant just because the search engine says so, there is no point in looking for anything.

>Sure bizinc.com may not be about business basic programming, but
>their company name is Business Basic Inc. it's not hard to see
>how AV picked that site.

But no "business basic" sites are linking to it. NONE. Why should they? "Business Basic Inc." doesn't have anything to do with the programming language.

You've got these two papers that say the links coming and going are evaluated, so why did bizink fool the system?

It's not a themes issue, it's a gaffe. There are no themes about business basic on "My Music Page", "links", and "lac la biche links" (the titles of the first three pages in the "link:http://www.bizinc.com/" search).

>"No, Seth. What they analyzed was a closed system. Analyzing a
>close system doesn't provide any useful insight into the
>structure of an open system. This paper should not survive peer
>review. Its assumptions are invalid, its testing methodology is
>faulty, and the conclusions are weak."

>I guess I don't understand what you find "closed" about this
>system. Sure they're professors personal homepages and probably
>have a defined userbase, but I think this is true with most
>personal homepages.

It's not true of the major search engines. They ran the ranking algorithm test on a database that by the largest measurement contained no more than 1800 sites, most of which (if not all of which) had something to do with computer science. The system is "closed" in that it's a topic-specific system. You can't predict how the algorithm will work in an open system (like the full Alta Vista database, which the other paper used) based on results from a closed system.

All that testing on a closed system shows is that the algorithm works on the closed system. They have essentially defined an algebra, a data set with its own rule.

>"Solid research doesn't just toss out assumptions like they are
>indisputable facts. It establishes what the facts are and
>ensures they cannot be disputed."

>I don't think that was the point of this paper. I think it was
>more about recoginzing that "page rank" is flawed, and trying to
>find ways to improve it. They weren't trying establish
>indisputable facts, they we're trying to come up with new ideas.
>Anyone can view these pages and anyone can't link to these
>pages. What's closed about this?

The test data is closed. And, yes, they are trying to put a band-aid on a gushing wound. They won't staunch the flow of blood by trying to fix a flawed algorithm as long as they continue to use flawed assumptions.

First they need to vet their assumptions and turn them into axioms. If they try and fail to do that, then they'll be on the path to finding something that actually works.

DrCool




msg:818507
 6:23 pm on Aug 10, 2000 (gmt 0)

I think one thing we need to realize it that the search engines have a different definition of the "best" page for any particular search than we may have. We might be looking for an informative site with pages and pages of text on a subject but the next person down the road will do the same search and look for a fancy page with lots of pictures and animation.

We must remember that the largest percentage of web users is from AOL. Generally speaking, AOL users are, how should I say it, on the low end of the internet knowledge scale? They won't be looking for huge technical documents on, for example, the technology that goes into encoding a DVD movie. They will be looking for a site that sell DVDs. The search engines must focus their results around returning the results the majority of the people will be looking for. If they only cater to the few of us who have been enlightened by the Internet gods, they won't be in business for long.

Using this thinking, the search engines are somewhat justified thinking that the site with the most and best links will be the one that the majority of people find useful. If, for example, I have a site dealing with "The Lord of the Rings", and it has some cute pictures, some movie information, a history of the books, etc., and it was an all around decent, simple site, many people could find the information helpful and useful as well as like the visual aspect of the site. In turn, they would link their personal home page to the site because they want to share it with others. On the other hand, I could have a much more informative site with pages and pages of text, much more information on the movie cast, and in general tons of information. This page would probably be a more complete and relevant site for "The Lord of the Rings", but it could very easily be over the head of the average internet user. In turn, fewer people would link to it because they didn't find it useful. If I am the search engine I want to appeal to the masses so my thinking would be that the first site will provide more to my users than the second site so I would list it first.

I do not agree with this thinking but then again I am not running a multi-billion dollar search engine with millions of users per day. The bottom line is the search engines need to appeal to the largest market sector they can and please those people as often as they can. In many ways it makes me think that the researchers are finding the most popular (not best) sites on a number of topics and reverse engineering their algorithms to suit those sites.

Michael Martinez




msg:818508
 6:45 pm on Aug 10, 2000 (gmt 0)

>I think one thing we need to realize it that the search
>engines have a different definition of the "best" page for
>any particular search than we may have. We might be
>looking for an informative site with pages and pages of
>text on a subject but the next person down the road will
>do the same search and look for a fancy page with lots of
>pictures and animation.

If you're politely suggesting I may be coming down a little too hard on Alta Vista, then I have to concede I AM being harsh on them. I don't like creating content only to have it buried by an algorithm that is obviously not going to bring real content to the top.

I only ask for 1 top ten listing. In my most precious search phrase, "lord of the rings movie", they give the 1 and 2 slots to the same domain, and I've been bumped to number 12 by a couple of irrelevant sites and a couple of 404 sites. They need to spend less time crunching meaningless numbers and more time cleaning up the database.

>We must remember that the largest percentage of web users
>is from AOL. Generally speaking, AOL users are, how should
>I say it, on the low end of the internet knowledge scale?

[snip]

Ten per cent of my visitors come from AOL. I understand what you are trying to say, but AOL has managed to build some user loyalty despite the high turnover rates they reported during their years of wild growth. The average AOL user may still be less sophisticated than the average Internet user, but they learn.

And many of them are, in fact, technically-trained and oriented people. AOL is the heaviest hitter in the online service marketing game. I don't get flooded with free disks by their competitors.

So, catering to AOL users is not catering to the DVD market (most people looking for music on the Internet are trying to get it for free anyway).

There is simply no justification for using links to determine which site is the best or better in a topic.

Anyone can create a Web page. But you have to be an expert marketer to get your Web page to rank highly on the search engines. What you know about the topic of your Web page, how thoroughly your page covers its topic ... these factors don't matter.

What the search engines should be doing is bringing people to the content sites, not to the link-popular sites. There is no correlation between content and link popularity.

DrCool




msg:818509
 6:53 pm on Aug 10, 2000 (gmt 0)

I agree that there is no correlation between links and content but there is a correlation between links and the sites the majority of the web users find useful.

Michael Martinez




msg:818510
 7:20 pm on Aug 10, 2000 (gmt 0)

That correlation is due to the fact that those links help the surfers find the sites they deem useful. Most people, not being expert in any given subject, are not very discriminatory about what they find on their searches as long as whatever they find appears to be authoritative.

A cute example from Lord of the Rings fandom:

Last week the Sunday Times (of London) ran a story on Cate Blanchett, the actress who will play Galadriel in Peter Jackson's "Lord of the Rings" movies. As part of the article, they provided a brief summary of Galadriel's character.

The summary was lifted from a Web page designed to fool students looking for book report material on the Web. This page comes up with a high ranking on many search engines when you type in "The Lord of the Rings" (or some variation thereof).

The news article was literally laughed off the Web. Tolkien fans and fan sites around the world pointed out that the reporter had cited an obviously wrong account. The Sunday Times took the article down within a matter of days.

Unfortunately, a New Zealand newspaper recently reprinted the article in toto, unaware of the gaffe.

seth_wilde




msg:818511
 8:57 pm on Aug 10, 2000 (gmt 0)

"That correlation is due to the fact that those links help the surfers find the sites they deem useful. Most people, not being expert in any given subject, are not very discriminatory about what they find on their searches as long as whatever they find appears to be authoritative."

You have to understand that this is the best method technologically possible (at this time). They have to rely on surfers determining useful sites because there is no possible way for a computer to do it.

Michael Martinez




msg:818512
 10:13 pm on Aug 10, 2000 (gmt 0)

>You have to understand that this is the best method
>technologically possible (at this time). They have to rely
>on surfers determining useful sites because there is no
>possible way for a computer to do it.

But it is NOT the best method technologically available. It is the WORST method technologically available. This methodology is to document indexing what the Windows operating system is to computer management: the bottom of the barrel.




seth_wilde




msg:818513
 10:54 pm on Aug 10, 2000 (gmt 0)

"But it is NOT the best method technologically available. It is the WORST method technologically available. This methodology is to document indexing what the Windows operating system is to computer management: the bottom of the barrel."

OK, I know you like the old system better, but it already had it's chance. It filled up with spam and gave out ten time less relevant results. From what I can see your absolute hatred for the new system leaves you in the minority.
[webmasterworld.com...]

Anyway I can see that we are both way too stubborn to have our minds changed :) So i suppose I'll move on to something more productive.


Michael Martinez




msg:818514
 11:25 pm on Aug 10, 2000 (gmt 0)

Being in the minority doesn't make one wrong any more than being in the majority does.

At the very least, I have shown I base my objections to the decision on sound analysis of the methodologies' weaknesses and on the fact that current results from Alta Vista are proveably untrustworthy.

Unless Error 404 and irrelevant pages are what we should all be searching for.

I don't believe in buying into fad programming, and that is all the current methodology amounts to.

chiyo




msg:818515
 3:24 am on Aug 13, 2000 (gmt 0)

Michael is right of course that keyword weighting is, in a perfect world, a good way to determine relevance. But it is FAR too easy to spam, as Michael acknowledges himself. External links is an imperfect method but it is less easy to spam and spamming efforts are more easily corrected.. (banning links-to-you type efforts for example).

But rather than debating what is better, surely the smartest indexes are now using a combination of both keyword analysis AND link analysis. Both Google and AV do that, and Inktomi, with varying success. I am sure there will be other ways developed in the future also to maximise relevance. Theming seems to be very big at AV - (We oursleves have suffered as a consequence), so thats a third way. And incoporating the authority vested by human directory review such as using yahoo, Looksmart and ODP is a fourth way.

ALL are imperfect methods, but together they reduce spam and increases relevance.

Now, the challenge is - how to make each method more intelligent and finally how put it all together, rather than concentrating on which method is theoretically best.

Now the move to specialist engines which focus on an authoritative specialization in an area rather than mega databases is my take on where Seraching behaviour is heading. The field of Searching should and my feeling is, getting decentralized.. People in furture may not go to one megaindex but will go to those which they trust and addrtess their search task for the specific question.


Michael Martinez




msg:818516
 7:32 am on Aug 16, 2000 (gmt 0)

Unfortunately, I think that spam will always rear its ugly head. And link popularity can be spammed. Easily spammed. I've yet to see the link popularity-based engines filter out the serious LP spam.

The problem is that it is so easy to get other people to link to your Web site, and I've seen major Web sites open up and link to non-major sites in several categories. You just cannot trust the results of searches which give weight to this kind of relationship. As a long-time search engine user, I know to look beyond the first page of results (heck, at Alta Vista, I often click on page 20 just to see what they've moved back). But the bean-counter guys seem to be saying that frustrated searchers just move from search service to search service.

I read somewhere that someone was suing the Realtime Blacklist people, the folks who maintain a list of domains that they allege supported email spammers. If that's the case, then I suppose there is no hope of the search engines maintaining public blacklists of spammy sites. It seems to me that only identifying known spammers will keep them out of the databases, but it's not legally safe for the search services to share data on who is abusing the system.

We need a better system. Settling for second best is not, in my opinion, the way to go. Webmasters should speak out, criticize the decisions being made, and press for continued research and innovation. But link popularity is the wrong way to go. It simply doesn't work, and there is no scientific merit to the scheme.

seth_wilde




msg:818517
 3:11 pm on Aug 16, 2000 (gmt 0)

continued here [webmasterworld.com]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved