homepage Welcome to WebmasterWorld Guest from 54.196.199.101
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
Forum Library, Charter, Moderators: phranque

SEM Research Topics Forum

    
Question about term frequency/IDF
chornbeck




msg:3126501
 3:48 am on Oct 19, 2006 (gmt 0)

I want to make sure I understand this right. My site is centered around a keyword that happens to be the name of a city. My site is a guide to that city, with hundreds of pages of content.

According to the term frequency-inverse document frequency model of text retrieval that most SE's use, that is, if I understand it right, I shouldn't use the name of the city in ANY of the site's pages (ideally) except the home page.

What I understand about tf-idf is that the more the word is used throughout the entire corpus (site) the higher the IDF and hence a lower overall relevancy score.

Is this a correct line of thinking?

 

Marcia




msg:3126508
 4:18 am on Oct 19, 2006 (gmt 0)

I don't think it implies that at all, that it should be used only on the homepage. And BTW, IDF is also significantly mentioned in conjunction with LSI.

Here's a basic paper that gives a clear definition of IDF and covers a few other related concepts

[www-static.cc.gatech.edu...]

ciml




msg:3126833
 11:48 am on Oct 19, 2006 (gmt 0)

What I understand about tf-idf is that the more the word is used throughout the entire corpus (site) the higher the IDF and hence a lower overall relevancy score.

In this case, the corpus would be the Web as indexed by that engine.

So in an engine using IDF, if you searched for [widgets in cityname], then either widgets or cityname would be given more weight to match documents, depending on which is rarer.

I think that most people would agree that it is helpful for search engines to mention the city name on pages about that city.

chornbeck




msg:3126998
 2:21 pm on Oct 19, 2006 (gmt 0)

Excellent. My mistake was considering my entire site as the corpus rather than the entire index. That makes much more sense... That's what I get for tackling these kind of topics when I'm sleepy.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved