TF-IDF is a method for creating a vector to represent a document. The two parts of the name stand for term frequency and inverse document frequency. It's a relatively simple concept and the math behind isn't too wild.
Let's say you have a collection of documents
TF-IDF helps with the search by giving weight to documents where the term frequency within a document is high and the document frequency for a given term is low.
It's normal in practice to start by cleaning the data. In most situations, this means filtering out common terms (called stop words in natural language processing) which don't contribute to the meaning of a document, like the, an, etc.
Term Frequency
In the first part, term frequency, you would go through each document term-by-term (words, key/value pairs, pixel position and colour values). As you work your way through the documents, you add each term to a numbered index and make note of the number of times it appears in each document. By the end you'll have two results:
- A vocabulary, the index of terms, which looks like
, and - A set of vectors for which looks like
where each vector represents a document and each value is the number of times each term in the index showed up in that document.
These can be defined mathematically, where
The next part, inverse document frequency is a little more complicated, but not much more.
Inverse Document Frequency
So we have a list of terms and a their frequency in each document. Does that tell us much though? If a word appears 100 times across all documents and another word appears 10 times, does that mean the first word is 10x more important than the second term? Likely not. To mitigate this, the IDF values will be used to normalize term frequencies on a logarithmic scale, scaling up the importance, or weight, of rare terms while scaling down the importance common terms.
The IDF value for each term is the
The result here is that terms which only appear in low numbers of documents will end up with a higher IDF value than terms which appear in many documents.
Combining TF and IDF:
This completes the weighting formula and illustrates how TF-IDF isolates highly relevant connections. The combined TF-IDF value will only be large if the term appears frequently in a document and infrequently across all documents.
Normalization
So now we have the TF-IDF values, but we still need to normalize them correct for their wide range. To do this, we'll calculate the unit vector using Lebesgue spaces (
Lebesgue spaces have a ton of applications across different fields, from probability to statistics, across finite and infinite dimensions, but anyone who remembers trigonometry will have encountered them too.
The gist is that they're used to normalize distances in vector spaces. Let's say you draw two points on a sheet of graph paper, one at
The two lines would intersect at
If you dig into your deep past to pull out those middle school math lessons, you'll probably find this formula:
The L1-norm would be sum of the components of the vector. Going back to the graph paper, it would be the sum of the lengths of each line if you zigzagged along the paper's grid instead of drawing a straight line.
We don't really need to worry about
To simplify things, let's run through this process with a single vector rather than the full matrix first.
Since this is a normalization process, we can feed the result back in to a similar equation and expect to get
Putting it all together
Now that we have a theoretical foundation in place, we can move on applying these concepts to matrices.
We start with our separate TF and IDF data.
In order to apply the
Then finally we L2 norm to the matrix, row by row.
As before, the result can be verified by taking the L2 norm of each row and they'll all come out to 1.0.
Cosine Similarity
So now we have a matrix with a vector for each document containing weighted scores for each term in the vocabulary. That on its own is useful for ranking documents by individual terms, keyword extraction, clustering, and anomaly detection. Cosine similarity adds another layer of usefulness on top of that though by enabling us to search and compare documents effectively.
The process involves taking the dot product of two vectors, and then taking the cosine of the angle between them. The vectors could come from two documents or an external source (like a search query) and a document.
Dot products are fairly straight-forward conceptually. They're the sum of each element multiplied together.
An interesting property appears when the dot product is 0. This happens when two vectors are orthogonal to one another. This is easy to visualize in 2-dimensional space: Two vectors intersecting at a 90 degree angle, the vectors won't project onto each other to form a triangle. The neat thing is that this also holds true in higher dimensional spaces.
The closer two vectors are to orthogonal, the less similar they are. That is, the angle formed between two vectors is a measure of how closely related they are. This is useful for us since two documents where related terms are used but in very different frequencies will still have a small angle, making it easy to identify them as being related.
While two documents which might share a few of the same highly weighted words but not much else will have an angle closer to orthogonal. Basic keyword matching would have marked them as similar, but cosine similarity is able to see past that.
The formula for cosine similarity is straightforward as well, this is trigonometry after all. As a refresher:
The adjacent vector is simply the dot product of the other two and the hypotenuse is the same:
The angle
If
While we're using it in the context of text analysis here, it can be used for any data which can be expressed as matrices, eg: images.