| Question about term frequency/IDF
|
chornbeck

msg:3126501 | 3:48 am on Oct 19, 2006 (gmt 0) | I want to make sure I understand this right. My site is centered around a keyword that happens to be the name of a city. My site is a guide to that city, with hundreds of pages of content. According to the term frequency-inverse document frequency model of text retrieval that most SE's use, that is, if I understand it right, I shouldn't use the name of the city in ANY of the site's pages (ideally) except the home page. What I understand about tf-idf is that the more the word is used throughout the entire corpus (site) the higher the IDF and hence a lower overall relevancy score. Is this a correct line of thinking?
|
Marcia

msg:3126508 | 4:18 am on Oct 19, 2006 (gmt 0) | I don't think it implies that at all, that it should be used only on the homepage. And BTW, IDF is also significantly mentioned in conjunction with LSI. Here's a basic paper that gives a clear definition of IDF and covers a few other related concepts [www-static.cc.gatech.edu...]
|
ciml

msg:3126833 | 11:48 am on Oct 19, 2006 (gmt 0) | | What I understand about tf-idf is that the more the word is used throughout the entire corpus (site) the higher the IDF and hence a lower overall relevancy score. |
| In this case, the corpus would be the Web as indexed by that engine. So in an engine using IDF, if you searched for [widgets in cityname], then either widgets or cityname would be given more weight to match documents, depending on which is rarer. I think that most people would agree that it is helpful for search engines to mention the city name on pages about that city.
|
chornbeck

msg:3126998 | 2:21 pm on Oct 19, 2006 (gmt 0) | Excellent. My mistake was considering my entire site as the corpus rather than the entire index. That makes much more sense... That's what I get for tackling these kind of topics when I'm sleepy.
|
|
|