Term-Frequency – Inverse Document Frequency (TF-IDF) is a concept and metric that is used a lot by search engines in various forms in their search algorithms.
The standard way, that we are accustomed with to calculate how frequently we should use a particular keyword that we are optimizing a page for – uses the simple concept of keyword density.
Keyword Density is a simple metric and can be calculated easily and quickly…
Keyword Density = (number of occurrences of keyword on page) / (total number of words on page) x 100
Most SEO experts will tell you to keep your Keyword Density ratio at around 1.5% to 3% and as high as 4% at times.
What is TF-IDF?
TF-IDF is an algorithmic concept that has been used in linguistics and information architecture because of its speed and efficiency to process large quantities of data quickly. It has been used in many research fields.
By studying the occurrence of keywords on a specific page and comparing it to the occurrence of that keyword on a larger set of pages – the algorithm can start grouping similar pages together, making entity profiles for each page.
The AI algorithm helps Google understand what the page is about and what all concepts are discussed in the page by comparing occurrences of keywords in it to other pages in its database that it has already grouped together.
While TF (Term Frequency) is a pretty simple concept to understand, IDF (Inverse Document Frequency) is slightly more complex and their combination TF-IDF is indeed complicated, but thankfully you don’t need to do any complex math as there are tools that I talk about in this post that do all the heavy lifting for you!
TF-IDF can be calculated using the following formula
With TF-IDF Google can determine which terms are topically relevant or irrelevant – by analysing how frequently a term appears on a page (term-frequency / TF) and comparing it to how often it is expected to appear on an average page (based on a much larger data set of documents it has in its index (inverse document frequency).
Google determines the relevancy of a page to a keyword or query by analysing the pages in its index against a number of specific features that it sees as relevant to the query.
These “features” are usually the presence or absence of certain terms and phrases on a page – and their prominence on the page as compared to all the other pages on the web.
In this manner Google understands the “context” in which the keyword was used in the page and matches up its use with other pages – so it can group the pages as being similar.
TF-IDF helps Google avoid serving junk pages that are irrelevant to the topic being searched and the content of the pages it serves.
Term Frequency does away with keyword stuffing. Because TF is based on a logarithmic value, you cannot simply stuff you page with the keyword and expect it to perform better in rankings.
TF uses logarithmic scales so simply adding a keyword does not increases its prominence or TF value by fixed amounts. The incremental value and effects are dependant on other factors.
For example a page with 1000 words with a keyword appearing 10 times (1% keyword density) has a TF (term-frequency) value of 4.32/9.97 = 0.43 , if log Base 2 is used).
Now, if you double the frequency of the keyword to 20 (2% keyword density) then its TF won’t be affected much and is 5.32/9.97=0.53
Term Frequency will basically tell you how often you are using a keyword or how rarely you are using it. In isolation it is not powerful enough and of no use for search algorithms.
It is ingeniously combined with IDF concept to have much value for Search Engines.
Inverse Document Frequency
IDF helps compute the real value of a specific keyword by measuring the ratio of the total number of documents in a set to the number of documents that actually contain the keyword.
It basically allows a search engine to compute what a page is about using semantic analysis through TF-IDF formulas.
If the keyword is a common word (and occurs in a large number of documents) – then its IDF value will be very small and if we multiply it by its TF – the value won’t change much. On the other hand, if the term is rare and found only in a few documents – its IDF value will be much larger – resulting in a larger TF-IDF score.
This makes TF-IDF more advanced and sophisticated formula that reflects the true importance of a given keyword to a given page. It reduces the importance of unimportant words and phrases (stop words, etc), while rare meaningful terms are scaled up in prominence.
How To Use TF-IDF to Optimize for SEO
It certainly is possible to do optimization of TF-IDF for each pages keywords and skew the data in your favour – so your page passed down stronger on-page ranking signals than other pages about the same topic (assuming all other ranking factors are constant).
The aim is to optimize the content on your page with topically relevant keywords – so your page becomes more topically relevant and ranks higher and faster.
Here is a snippet from Google Webmasters Hangout where John Mueller addresses TF-IDF
Although John Mueller does say that its a metric that’s been around for a while and that you should not focus on it to artificially doctor your on-page keywords and that you should focus on other SEO factors – this is just another smoke and mirrors attempt to deviate from a method that does work scientifically because their algorithm uses it.
My take is that as long as you do this right and understand the data while optimizing your keywords – this is one metric that is highly effective and will boost your rankings.
Tools that can help you with TF-IDF for SEO
The two tools that I use to in my on-page SEO to help with TF-IDF values are – ntopic.org and Website Auditor
I like ntopic a lot, and I have not used Website Auditor much for this – although I have read it can help achieve the results.
Ntopic costs around $10 for 100 queries, while Website Auditor comes at a heavier price point – but also does a ton of other SEO related functions. I was lucky enough to get a lifetime copy of Website Auditor.
Both these tools will give you suggestions of new keywords based on Google’s analysis of larger datasets and the top ranking pages in the SERPs which are semantically related to the keyword you are optimizing for.
All you need to do is go through each keyword and insert it within context on your page and run the TF-IDF analysis again to see how much your optimization helped. Simply rinse and repeat so you gradually insert all the missing keywords into the page, with a TF-IDF value that is as close to the suggested value.
In the near future, I will make a video walkthrough that shows how to go about using TF-IDF to optimize you content, so you can get a better understanding of the process involved and the tools. Make sure to sign up for updates below, if you want to be alerted when this is posted!