Welcome to WebmasterWorld Guest from 50.17.78.238

Forum Moderators: phranque

Message Too Old, No Replies

Question about term frequency/IDF

     

chornbeck

3:48 am on Oct 19, 2006 (gmt 0)

5+ Year Member



I want to make sure I understand this right. My site is centered around a keyword that happens to be the name of a city. My site is a guide to that city, with hundreds of pages of content.

According to the term frequency-inverse document frequency model of text retrieval that most SE's use, that is, if I understand it right, I shouldn't use the name of the city in ANY of the site's pages (ideally) except the home page.

What I understand about tf-idf is that the more the word is used throughout the entire corpus (site) the higher the IDF and hence a lower overall relevancy score.

Is this a correct line of thinking?

Marcia

4:18 am on Oct 19, 2006 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I don't think it implies that at all, that it should be used only on the homepage. And BTW, IDF is also significantly mentioned in conjunction with LSI.

Here's a basic paper that gives a clear definition of IDF and covers a few other related concepts

[www-static.cc.gatech.edu...]

ciml

11:48 am on Oct 19, 2006 (gmt 0)

WebmasterWorld Senior Member ciml is a WebmasterWorld Top Contributor of All Time 10+ Year Member



What I understand about tf-idf is that the more the word is used throughout the entire corpus (site) the higher the IDF and hence a lower overall relevancy score.

In this case, the corpus would be the Web as indexed by that engine.

So in an engine using IDF, if you searched for [widgets in cityname], then either widgets or cityname would be given more weight to match documents, depending on which is rarer.

I think that most people would agree that it is helpful for search engines to mention the city name on pages about that city.

chornbeck

2:21 pm on Oct 19, 2006 (gmt 0)

5+ Year Member



Excellent. My mistake was considering my entire site as the corpus rather than the entire index. That makes much more sense... That's what I get for tackling these kind of topics when I'm sleepy.