Forum Moderators: open

Message Too Old, No Replies

Stemming and keyword "families"

grouping keywords into categories

         

Marcia

10:08 am on Mar 13, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I understand some search engines can recognize keyword relationships and relevancies within "families" or broad groupings.

For example, beef, lamb and pork are all meat, chicken, turkey and duck are all poultry, and all together they are in the meat/poultry or protein category. Brocccoli and carrots are vegetables and apples and oranges fruit, but they are all produce. All put together, this is all food - connected with recipes and cooking - broadly related to "home" when we're thinking about search terms and broad categories, and when you think about it, "family" in thinking what types of sites info would be found on these.

Couple of questions:

Which of the search engines are capable of recognizing search terms within more broadly inclusive categories, possibly on a hierarchical basis, and

With which of the search engines does it matter? With which of them does the algo take into consideration the broader categories and inbound and outbound links resulting?Are links from food sites good enough for meat or beef sites?

This would be related not only to theming of web sites, but importantly, choosing the list of keywords to use on sites so that they will be relevant and make a site easy to find for the searchers themselves. It could also help in devising an effective linking strategy, pruning link possibilies down to disregard those that are not relevant, and perhaps broadening the scope of possibilities for finding linking partners for links that might actually be mutually productive.

Taking it a step further, food (or the subcategories) could relate to shopping - as in marketing - the Piggly Wiggly kind, not the Madison Avenue kind. Another whole issue.

Which of the search engines is capable of telling the difference between marketing as in going to the fish market or marketing as in viral marketing? Which is capable of telling the difference between fishing for trout or fishing for answers?

This is just using a simple everyday example - but choosing keywords is so difficult, there has to be more theory behind doing it right.

Basically, this is related to thinking about a site with a broad general category, with "subdivisions" for more specific sub-categories.

heini

12:58 am on Mar 14, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for this post Marcia
I was thinking about this very topic the last few days in the european engines forum. How do search engines group related words together for theming? How good are they really on this? The starting point over there was, if engines would be able to group words together even across language barriers. Which is an interesting question for the link strategy of non-englisch sites. To get some answers on that we need to achieve a better understanding on how this grouping of words works at all.
I really do hope for some enlightenment here!

Marcia

3:11 am on Mar 14, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>I really do hope for some enlightenment here!

Me too, heini. What I neglected to mention is how categories can be confused with derivatives of some words. We can know what holiday crafts, paper crafts or wood crafts are but when we go to the word "craft" singular, it could also mean wicca. When talking the derivative - craftsman - we could be referring to craftsman tools (put out by sears). Craftsmanship can apply to not little artsy stuff, but manufacture of household furniture.

So getting into stemming can impact the relevancy issue - and it would also help to know which search engines, as well as how it's figured in.

It's easy enough to pick out the keywords Google highlights in the cache, but not quite so easy to figure out the way they are giving weight to issues related to keyword "families" that we're talking about here. This would also be an issue with Ink and AV, particularly in how they are weighting links.

sean orourke

5:27 am on Mar 14, 2001 (gmt 0)



Macia, you have just asked the $64,000 question. It is something I've been pondering more than answering. One hunch is that keyword families could be based on existing directories.

Another hunch is to play around with Northern Light and observe their folders. I tried some variations of the examples you listed. NL uses stemming [omsee.com], so differences such as singular/plural do not matter. (However, craft & craftsman do not equal since "-sman" is not a common suffix)

Here is an example of the folders for "broccoli"

Vitamins, Cancer, Gardening, Herb gardens, Food & cooking, Cookbooks, Fruits & vegetables, Agriculture industry, Vitamin C (ascorbic acid), Columbus Ohio, Recipes

That is about as clean as it gets. Many other words do not fare so well. Also, I'm not sure if you can really look at this in pure hierarchical terms. Hmmm... time to go chase down that cornell.edu link that has been floating around here...

seth_wilde

6:25 am on Mar 14, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"I understand some search engines can recognize keyword relationships and relevancies within "families" or broad groupings."

There has been some speculation about this but from what I've seen, I've come to believe that only some parts of this are in use. The content of linking sites definitely affects the level of the link popularity boost but not to the extent of what you described.

The technology that comes the closest would probably be the term vector database. In this scenario a SE would take a page and then run it through a filter. This would remove stop words along with the most frequent and least frequent third of terms in the entire database. They then run the page through a weighting algo to determine the 50 most important words on the page. These 50 word are that page's vector. This makes it easy to then compare linking pages vectors to see how closely they are related and determine how much link popularity weight that page should be given.

This technolgy is being pursued for something more closely related to what you described, but instead of being used for ranking, it's used to auto classify pages.

For example: If you determined the term vector of all the pages in a Yahoo category you could come up with an over all vector for that catogory. This would allow a spidering engine to display a category next to each listing on a serp. So if you searched for "doors" you still get normal listings but some would be marked as matching the music category and some would be marked as matching the construction category.

Since different users would have different needs when searching for "doors". This would allow the user to determine for themselves which page matches their needs, rather than guessing what their looking for.

I guess as an overall answer to your question I would speculate at this time no search engine can do what you described, but similar projects are in the works.

I would concentrate mostly on the content of pages directly linked to a particular page (both internal and external) by trying to work in your important keywords in both the content and link text as much as possible. And then do this to a lesser extent with internal pages a few links away. This should help with most engines including google (which seems to concetrate on pages one link away) and theme engines that concetrate more on the overall site

WebGuerrilla

8:06 am on Mar 14, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>This technolgy is being pursued for something more closely related to what you described, but instead of being used for ranking, it's used to auto classify pages.

I think this is definitely where any relational matching technology will go. I remember awhile back that Thunderstone was attempting to build an artifical directory that classified sites into Yahoo type categories based upon recognizing relationships between groups of words. I'm not sure what ever happen to it, and I can't seem to find the link.

The problem with trying to apply any kind of relationship or category matching algo to traditonal search results is determining at what point a related phrase is a better match than a similar page with an exact match.

If someone searches for the term chicken recipes, they may find a site containing poultry recipes a good match, but is it a better match than the thousands of pages that focus specificly on chicken recipes?

I think the shear number of sites containing tightly focused content that now exist, somewhat eliminates the need for a search engine to try and figure out potential matches other than the phrase that was actually searched on.

Using the technology to develop a secondary category structure is the only thing that makes sense. Of course, moving forward with developing a technology that elimnates the revenue stream provided by partners like LookSmart doesn't seem likely anytime soon. :(

mivox

7:50 pm on Mar 14, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here [search.thunderstone.com] is a link to Thunderstone's 'about' page...

That's about all I have to contribute... it's morning for me, and this thread is making my head spin. :)

Marcia

12:49 am on Mar 15, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the Thunderstone link mivox - very interesting. Some get dizzy, and others get insomnia from threads like this!

>developing a technology that elimnates the revenue stream provided by partners like LookSmart doesn't seem likely

Currently not appearing to threaten to affect the integrity of Google, WebGuerilla, and maybe FAST, who are not now dependent on that specific type of revenue.

Hopefully they will continue to remain free of the need for it, so that there will remain available for us some source of objectively accurate search results that are not subject to the corruption and distortions caused by financial manipulations - just mho, but to me in a perfect world there would be level playing grounds both in sports and business.

It's a point well taken, and I'm personally glad it came up in this context. It's an issue that certainly has a lot of impact - I would not have thought about it. I just did get finished looking at 3 pages of paid directory listings before the first Ink results were shown. The directory titles and descriptions were certainly not as relevant or well done as the ones that followed, in this category.

All proving a minor but significant point, that well done search engine technology is far more capable of delivering good search results than poorly done directories.

Now I can see that for Inktomi it's best to choose words and phrases for emphasis that do not have many directory listings, if it's MSN that's being targeted.

Did we get to an answer on which of the search engines are utilizing stemming, or groupings, and which are not?

sean orourke

1:16 am on Mar 15, 2001 (gmt 0)



> Did we get to an answer on which of the search engines are utilizing stemming, or groupings, and which are not?

Stemming yea/nay can be found at the familiar Search Engine Watch Features for Searchers [searchenginewatch.com] page. Grouping of "families" is a whole different matter, one that probably lends itself more to speculation than yea/nay at this point.

wisser

8:45 am on Mar 15, 2001 (gmt 0)



Finding groups of related words is really hard. Most words will relate to many groups and by just one word you can't decide. So
'brocolli' can relate to cooking, gardening or maybe genetics. That means three totaly diffrent groups of words.
How to get the groups. We already heared about the vector database. It is mainly based on the links between documents.
But the really hard thing is how to decide which group gets displayed first.

WebGuerrilla

8:21 am on Mar 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>Currently not appearing to threaten to affect the integrity of Google...

My comment was really only directed at AltaVista. (can you tell I'm a bitter ex-user?)

What is really sad is the fact that AV was the pioneer in this type of technology. If you read section 3.3 of the original term vector paper [www9.org] they wrote, you will find that the type of classification system that groups pages into categories has already been built.

Implementing this type of system combined with the use of click tracking (to determine which category or group gets displayed first) could end up producing the most accurate and user friendly search engine yet. Unfortunately, AV's recent financial problems will probably prevent a solid working version from ever being released.

The question then becomes who, if anyone, will pick up the ball and run with it? Google seems like the obvious answer, but at some point in the very near future, Google will also have to deal with the reality that producing quality free results doesn't seem to translate into profits.

All in all, I think the development of the majority of new "3rd generation" search technology is going to come to a screaching halt. The current financial climate will force most to scale back to a leaner, more simple (and profitable)operation, which will generally mean a move back to an old school type spider combined with a revenue partner like Looksmart.

Robert Charlton

7:52 am on Mar 21, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



>>I remember awhile back that Thunderstone was attempting to build an artifical directory that classified sites into Yahoo type categories based upon recognizing relationships between groups of words. I'm not sure what ever happen to it, and I can't seem to find the link.<<

I thought I'd remembered that Dogpile's directory had been built by Thunderstone, but I just checked and they credit it to Infospace.

Here's a link to a Dogpile Web Catalog info page, and two paragraphs that might be of general interest:

[dpcatalog.dogpile.com ]

The 'Distillation' process
We download the latest .com, .net, .org and .gov domain listings on a regular basis. We then crawl each site to obtain their web pages. Then, each site's pages are examined as a whole to determine the principal subject matter areas that would best characterize the entire site using our automatic categorization technology. Additions and updates are performed at the rate of about 100 sites per minute. The growth rate is about 300,000 newly discovered sites per week.

The Categorization process
After a site's content has been acquired they are passed to Our Automated Categorization Engine. This process seeks to identify the general classifications under which a site belongs. The % figure that follows a site's category indicates the degree of confidence that the categorization engine had in its answer.

Brett_Tabke

9:05 am on Mar 21, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Dogpile is owned by Infospace (go2net.com/metacrawler..etal)

> beef, lamb and pork are all meat,

Yes, it is called a term vector relational databases.
How can they build such a database without an entire team of linguists? Far easier than you would belive. Fairly low tech infact. They reverse engineer it from search terms themselves:

Bob searches for "beef"
Then he searches for "lamb"

There is now a mathematical relationship between beef and lamb. A 1-to-1 correlation. That alone isn't enough to build a quality term vector relational db, but when you consider the volume of searches a day, they can build vector relationships very fast.

Lets give the above style of two searches a value of "5".

Now Sally does a search for "beef steak"
Next she searches for "beef market"
And later, "market trends"

Lets give that search style a value of "20". Now we have a small vector database:

beef is to lamb: 5 points
beef is to steak: 20 points
beef is to market: 20 points
market is to beef: 20 points
market is to trends: 20 points
market is to steak: 5 points
trends is to beef: 5 point
trends is to lamb: 0 points
market is to lamb: 0 points

Take that times 20 to 70 million searches per day and a relational term vector database can be built in a week using off the shelf databases software.

Google's Krishna Bharat talks about the Capture of Search Context to Support Web Search [www9.org] (just what I detailed above in laymans terms).

Where else can they apply it to what data? The easiest way to determine if a se is using a topic or keyword relational database is to look on their own results pages. If they have some type of "show related" option, then they have a term vector database - that is where that data is generated.

Alta: Yes, indepth. Also their partnership with Teragram is interesting. How deep they are taking it is unknown. alta v3 engine press release [doc.altavista.com]
Google: PageRank is really just a relational database using weighted link data as it's core. How far they take that is open to interpretation. I had believed they were using a full term vector db style analysis for each page. However direct from a engineer, we do no such thing on the whole page, only on the link text.
Ink: Yes. Little question to me that Ink does. They have that "topic spider" that can relate data in broad terms or micro terms. It is quite good. That is all build from relational databases. Ink Gen3 SE press release [www1.inktomi.com].
Direct Hit: Yes, however the name of the company that they work with escapes me. (anyone?). There is a linguistic company that does their keyword work...
Fast: No, but stay alert for changes rsn (real soon now).
NL: Don't think so.
Excite: Yes, and very good at that.

One of the most important modern papers on "topic engines" by Sougata Mukherjea of NEC: Analyzing Topic-Specific Web Information [www9.org] (be sure to notice Figures 8 and 9).

Marcia

9:33 am on Mar 21, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There's a LexiQuest logo on bottom of some of the HotBot search pages - is that possibly the company?

Brett_Tabke

8:55 pm on Mar 29, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Yes, that was it -LexiQuest [LexiQuest.com]. They also did Infoseek for awhile if I'm not mistaken.

Brett_Tabke

1:19 pm on Jan 8, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



*ahem*

IITian

6:36 pm on Jan 8, 2004 (gmt 0)

10+ Year Member



Since this old topic has been brought up, with so many excellent analyses, I will just add one observation of mine.

About two weeks ago looking into my referral file, I found a visitor made it to my site by searching for "companyname internship" that suprised me since my site have nothing about "internship" even though "companyname" is present. When I recreated the search, the word "international" was highlighten in the serps snippets.

Therefore, Google thinks internship = international in having same stem

Why? My guess is that since the first six chararters are same.

rmjvol

7:05 pm on Jan 8, 2004 (gmt 0)

10+ Year Member



geeze, talk about a needle...

funny how things change & yet stay the same ;)

Marcia

10:40 pm on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



*ahem*

That which was, is and will be again. Funny how this one just passed right on by without a stir! ;)

totter

10:12 am on Feb 3, 2004 (gmt 0)

10+ Year Member



Probably a couple of dumb questions but,

Would you guys consider Google Sets a good example of this?

If so Do you think that google sets tap into the entire group of google searches?

macrost

2:38 pm on Feb 4, 2004 (gmt 0)

10+ Year Member



From reading this thread, (thanks Brett for bringing it back to life) it does seem that sets does actually incorporate this more than their regular search.

(Too early in the morning, me head is already spinning!)
;)

Mac

broniusm

6:18 pm on Feb 13, 2004 (gmt 0)

10+ Year Member



I have been reading up on a really interesting thread about the Austin Update which Pimpernel describes Google weighting "authority sites" first [webmasterworld.com] in a search result-- as in auto-categorizing your search phrases into more than just keywords. I would agree that they're probably using their own google sets [labs.google.com] to achieve this.

Marcia

5:45 am on Feb 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is always an exciting topic, more and more so as time goes on and technology develops. And it's so fundamental that it's literally timeless. You can almost pick up where it leaves off, just filling in the pieces along the way to get up to date on it.

WG

...type of classification system that groups pages into categories has already been built.

Implementing this type of system combined with the use of click tracking (to determine which category or group gets displayed first) could end up producing the most accurate and user friendly search engine yet.

How much capability has the emergence of toolbar data made available to Google, along with the topical targeting of AdSense, clickthroughs, and the stats that come along with it.

BT

Bob searches for "beef"
Then he searches for "lamb"

Following right along with Bob, he then goes to a few sites, clicks on an AdSense ad, stays at the next site he arrives at, backs up to the previous one, or goes back and searches further - all with his toolbar motor running.

Then there's Yahoo having all that technology in hand with their acquisitions, moving forward with Yahoo Labs

[webmasterworld.com...]

and remarking recently that they'll be focusing on Local search.

Not only will they have to decipher whether the searcher wants market research or the meat market, they'll also have to eventually deliver to him the right Hollywood, depending on whether he's in Florida or California.