Welcome to WebmasterWorld Guest from 54.226.175.101

Forum Moderators: Robert Charlton & goodroi

Featured Home Page Discussion

Can TF-IDF or Keyword Density tools help rankings?

     
2:55 pm on Jan 7, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3446
votes: 335


I was recently asked about TF*IDF & possible impact on Google rankings so I figured it would more efficient to post my opinion here and let others share theirs. If you don't know about TF*IDF the over simplified answer is that its basically a more sophisticated way of looking at keyword density & comparing it to other pages. Google has not said they use TF*IDF in their secret ranking formula with over 100 moving pieces and Google has said keyword density is not a very useful ranking metric for them.

So what's my opinion on TF*IDF?

Yes, TF*IDF has helped me improve rankings for some sites. I am not saying it should blindly be chased after but rather I use it more one of many general quality control metrics. It indirectly helps me provide a better experience which helps with cna many different ranking factors.

TF*IDF & even simple keyword density tools can help authors make sure they didn't forget to cover an important aspect of a topic. You spend so much time writing & editing an article that you can sometimes forget to mention things or repeat something too much.

I don't see the danger in pushing a button and immediately see what keywords/phrases you forgot to use or are overusing. That insight can help improve the quality of the page. I've gained more traction from users who have commented that my content is more thorough than the competition and more enjoyable to read. That's in part because I use these keyword tools for my general QC process.

I don't care if Google is looking at this metric or ignoring this metric. It's useful to me so I use it and my rankings have gone up. Is that because my TF*IDF more closely matches what Google wants or is it because my content quality has improved and attracts more repeat users & backlinks? Honestly, it doesn't matter to me.

I would not blindly chase after some magic TF*IDF number and have had many pages rank with poor TF*IDF. I still like these keyword tools because it is usually an easy way to provide a more effective QC process and deliver better content.

What do you think?
3:53 pm on Jan 7, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38200
votes: 96


I believe with the overall devaluation of linking as a reliable metric, Google has to lean on other factors. Google has rewritten the web in it's page rank images (eg: linkage is currency and no one does it on the up-n-up anymore).
So, what other factors can Google use? I think it comes back to on-the-page metrics. Regardless of all the fancy AI at work, you still have too evaluate the page to know what the page is about. I still occasionally tweak pages with a keyword density analyzer. Like goodroi, it helps you find missed keywords and helps you to highlight those you want too.
4:03 pm on Jan 7, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2419
votes: 647


@goodroi your simple description and of the "tool" (its more of a metric than a tool) and of how you use it is great and makes perfect sense. But it is oversimplified and fully misses the crux of TF-IDF.

TF-IDF is metric used in data-science/natural language processing. It measures the frequency of some term in a document (web page) and compares it to average the frequency that the term appears in a larger collection of related documents. Take a document with the term "red socks", is the document about sports or fashion? TF-IDF can help you answer that question. Take the term , measure its frequency in the document then take that frequency and compare it to the frequency by with which the term appears in sports documents and fashion documents. If it is closer to one chose that one else the other. If it is neither, conclude that it is neither if it equally near both then pick at random? As you can see this is an imperfect metric, on many levels.

What group of documents are being used as the reference group (corpus), are you using the same corpus as Google,Bing, etc.?
How is term frequency measured? Are stop words removed ("the", "a" , "an", "and"... the most commonly occurring short words), are only nouns counted
What about mis-spellings intentional or not "red socks" vs "red sox" are those corrected or counted as unique?

What are you really measuring, how relative is to what Google is measuring?

But the bottom line is that this metric is used to determine relevance, there is absolutely no measure of "quality". In fact if one is missing a three occurrences of "red socks" in the document to be considered a sports document one could simply add a "sentence" to the end of the document "red socks, red socks, red socks!".

Why waste your time with this? Is Google having trouble classifying your web-page as a baseball page when it is really about socks that are red? I doubt it. Is it leading to more natural text, maybe, but my guess is that it is not. Google said that they don't use it, it is certainly not measure of quality.

The only reason to use it is that sounds really cool, Term Frequency by Inverse Document Frequency, and most people do not understand how it works and why and that makes you sound like a really smart and sophisticated SEO.
4:46 pm on Jan 7, 2019 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4982
votes: 42


I'm not sure which data source you're using for the idf score, it's worth mentioning that Mojeek provides exact search result counts rather than ballpark figures like other major search engines (with a couple of caveats; I work for them, their index has 2 billion pages and it's primarily English/European pages, so more effective in that area).
5:03 pm on Jan 8, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 10, 2018
posts: 184
votes: 27


There are some interesting additional complexities to add to this, because there are three types of keywords:

Exact match: "red apples"
Partial match: "red" ... "apples" ... "apple"
LSI keywords: "red delicious", "pink lady", "braeburn", "gala"

Another question is how different tools measure exact match keywords - for example, a complete sentence: "We grow the greatest apples in shades of red and green." might be measured by one keyword tool as an exact match query because it contains both red and apples, whereas another might measure only exact phrases, and count it differently, as partial matches.

We can't be sure exactly how Google is measuring keywords but we can make a guess by careful study of our competitors and through experimentation.

But the bottom line is that this metric is used to determine relevance, there is absolutely no measure of "quality". In fact if one is missing a three occurrences of "red socks" in the document to be considered a sports document one could simply add a "sentence" to the end of the document "red socks, red socks, red socks!".


Google has separate measures for quality, so to rank well you must cover all of your bases, you can't rank on a minority of factors. Unfortunately Google seems to be favouring keywords quite strongly at the moment. Keywords aren't incompatible with quality though, if they make you think about the depth of your topic and rewrite on that basis.

Though I can't confirm it, I suspect writing "red socks, red socks, red socks" in one sentence could well only count for one mention of red socks, or might even trigger a penalty score for that sentence. I learned the hard way that accidentally duplicating words in the url or page title can cause a penalty, even if they're part of your brand name, so I don't see why they wouldn't be doing the same trick on the body text, even down to paragraph and sentence.

Regarding keyword placement - you have to treat each element on the page separately, as well as look at the overall frequency. As I've said before, don't go much over 1.5x the serp top 10 average (removing anything that looks like query deserves diversity first). You will get a spam penalty for going too high. Try to mimic the location of the keywords used by your competitors in heading tags and body text. You can use partial match keywords much more liberally than exact match, and you can fill your page with plenty of LSI keywords without penalty *at the moment* (the competitor at the top of my niche has pages covered in almost pure LSI keyword spam, not even in sentences). That's not to say to repeat the same LSI keywords, just add all the variety you can think of.

Google seems to pick up LSI keywords from looking at the web pages already ranking and finding additional *meaningful* words in common. I have an anecdote about this too: I've been having real trouble ranking for the short tail in my niche. I'm in a niche where all of my page titles necessarily contain the short tail keyword along with long tail specific keywords, and one of my pages in particular was causing keyword cannibalisation issues. I recently managed to confirm to myself that Google really really loves to reward LSI keywords, and that the page causing a conflict contained an LSI keyword in the title. Everyone else in my niche was using that keyword on many pages, whereas I had a specific page dealing with that topic, and Google was trying to rank that page instead of my home page, even though it had less overall relevancy! I had to move the copy from it to my home page and 301 redirect it to get my home page to appear on the short tail.