TF-IDF is a method for creating a vector to represent a document. The two parts of the name stand for term frequency and inverse document frequency. It's a relatively simple concept and the math behind isn't too wild.

Let's say you have a collection of documents $D$ and you want to make targeted connections between those documents for searching or graphing or whatever. The documents could be the collected works of Shakespeare or the last 90d of log data or a hard drive full of spicy memes, it doesn't really matter.

TF-IDF helps with the search by giving weight to documents where the term frequency within a document is high and the document frequency for a given term is low.

It's normal in practice to start by cleaning the data. In most situations, this means filtering out common terms (called stop words in natural language processing) which don't contribute to the meaning of a document, like the, an, etc.

Term Frequency

In the first part, term frequency, you would go through each document term-by-term (words, key/value pairs, pixel position and colour values). As you work your way through the documents, you add each term to a numbered index and make note of the number of times it appears in each document. By the end you'll have two results:

A vocabulary, the index of terms, which looks like ${(1, romeo), (2, juliet), . . .}$ , and
A set of vectors for which looks like ${(12, 17, 3, 0, . . .), . . .}$ where each vector represents a document and each value is the number of times each term in the index showed up in that document.

These can be defined mathematically, where $F$ is the total number of terms in the index.

\begin{aligned} Vocabulary \\ E (t) & = {\begin{aligned} 1 if t = romeo \\ 2 if t = juliet \\ . . . \end{aligned}} \\ Counting Functions \\ fr (x, t) & = {\begin{aligned} 1 if x = t \\ 0 else \end{aligned}} \\ tf (t, d) & = \sum_{x \in d} f r (x, t) \\ Document Vector \\ v_{d} & = {t f (t_{i}, d_{i}) : i = 1 . . . F} \end{aligned}

The next part, inverse document frequency is a little more complicated, but not much more.

Inverse Document Frequency

So we have a list of terms and a their frequency in each document. Does that tell us much though? If a word appears 100 times across all documents and another word appears 10 times, does that mean the first word is 10x more important than the second term? Likely not. To mitigate this, the IDF values will be used to normalize term frequencies on a logarithmic scale, scaling up the importance, or weight, of rare terms while scaling down the importance common terms.

The IDF value for each term is the $\log$ of the total number of documents divided by the number of documents where that term appears. To avoid dividing by zero, we can simply add 1 to the divisor rather than filtering the 0s out of every vector. This won't impact our results since a) we'll be dividing by 1, b) the normalization process will flatten those values out, and c) we'll be multiplying the IDF value by the term frequency, which is zero.

\begin{aligned} idf (t) & = l o g (\frac{| D |}{1 + | {d : t \in d} |}) \end{aligned}

The result here is that terms which only appear in low numbers of documents will end up with a higher IDF value than terms which appear in many documents.

Combining TF and IDF:

tf-idf (t) = tf (t, d) \times idf (t)

This completes the weighting formula and illustrates how TF-IDF isolates highly relevant connections. The combined TF-IDF value will only be large if the term appears frequently in a document and infrequently across all documents.

Normalization

So now we have the TF-IDF values, but we still need to normalize them correct for their wide range. To do this, we'll calculate the unit vector using Lebesgue spaces ( $L^{p}$ ):

\hat{v} = \frac{v}{| | v | |_{p}}

Lebesgue spaces have a ton of applications across different fields, from probability to statistics, across finite and infinite dimensions, but anyone who remembers trigonometry will have encountered them too.

The gist is that they're used to normalize distances in vector spaces. Let's say you draw two points on a sheet of graph paper, one at $(6, 2)$ and another at $(2, 5)$ . Then you draw a line straight down from $(2, 5)$ toward $x = 0$ and another line from $(6, 2)$ toward $y = 0$ .

The two lines would intersect at $(6, 2) - (2, 5) = (4, 3)$ . The length of the lines between each point and the intersection would be 3 and 4 along the x and y axis respectively. The lines would also meet at a 90 degree angle. Now draw a line directly between the two points. Starting to look familiar?

If you dig into your deep past to pull out those middle school math lessons, you'll probably find this formula: $a^{2} + b^{2} = c^{2}$ , or $c = \sqrt{a^{2} + b^{2}}$ . This is the Euclidean distance between two points. It's also the L2-norm ( $L^{2}$ ). The $p$ in $L^{p}$ indicates that any real value can be used in this kind of normalization, so this can be generalized to $| | u | |_{p} = (| u_{1} |^{p} + | u_{2} |^{p}, . . . + | u_{n} |^{p})^{\frac{1}{p}} = (\sum_{i = 1}^{n} | u_{i} |^{p})^{\frac{1}{p}}$ where $u$ indicates that this is a unit vector in a normed vector space.

The L1-norm would be sum of the components of the vector. Going back to the graph paper, it would be the sum of the lengths of each line if you zigzagged along the paper's grid instead of drawing a straight line.

We don't really need to worry about $p > 2$ here though. While our vectors have more than 2 coordinates, L2-norms can be used on vectors with an arbitrary number of dimensions.

To simplify things, let's run through this process with a single vector rather than the full matrix first.

\begin{aligned} p & = 2 \\ v_{d} & = (12, 17, 3, 0) \\ \hat{v_{d}} & = \frac{v_{d}}{| | v_{d} | |_{p}} \\ \hat{v_{d}} & = \frac{(12, 17, 3, 0)}{(12^{2} + 17^{2} + 3^{2} + 0^{2})^{\frac{1}{2}}} \\ \hat{v_{d}} & = \frac{(12, 17, 3, 0)}{\sqrt{12^{2} + 17^{2} + 3^{2} + 0^{2}}} \\ \hat{v_{d}} & = \frac{(12, 17, 3, 0)}{\sqrt{442}} \\ \hat{v_{d}} & = (0.57, 0.89, 0.14, 0) \end{aligned}

Since this is a normalization process, we can feed the result back in to a similar equation and expect to get $1.0$ out the other side:

\begin{aligned} p & = 2 \\ \hat{v_{d}} & = (0.57, 0.89, 0.14, 0) \\ 1.0 & = \frac{\sum_{i = 1}^{| \hat{v_{d}} |} \hat{v_{d}}}{| | \sum_{i = 1}^{| \hat{v_{d}} |} \hat{v_{d}} | |_{p}} \\ 1.0 & = \frac{0.57 + 0.89 + 0.14 + 0}{\sqrt{(0.57 + 0.89 + 0.14 + 0)^{2}}} \\ 1.0 & = \frac{1.6}{1.6} \end{aligned}

Putting it all together

Now that we have a theoretical foundation in place, we can move on applying these concepts to matrices.

We start with our separate TF and IDF data. $tf$ is a matrix sized $| D | \times F$ , and $idf$ is a vector of length $F$ .

\begin{aligned} M_{tf} & = {\begin{aligned} tf (t_{i}, d_{1}) : i = 1 . . . F \\ tf (t_{i}, d_{2}) : i = 1 . . . F \\ . . . \\ tf (t_{i}, d_{| D |}) : i = 1 . . . F \end{aligned}} = [\begin{array}{c} 12 & 17 & 3 & 0 & . . . \\ 8 & 32 & 1 & 18 & . . . \\ . . . \\ . . . & 9 & 2 & 7 & 11 \end{array}] \\ v_{idf} & = (idf (t_{i}) : i = 1 . . . F) = (0.57, 0.89, 0.14, 0, . . ., 0.42) \end{aligned}

In order to apply the $v_{idf}$ weights to $M_{t f}$ , we'll have to transform it into a square diagonal matrix with both dimensions equal to $F$ and then multiply the two matrices:

\begin{aligned} M_{idf} & = [\begin{array}{c} 0.57 & 0 & 0 & 0 & . . . \\ 0 & 0.89 & 0 & 0 & . . . \\ . . . \\ . . . & 0 & 0 & 0 & 0.42 \end{array}] \\ M_{tf-idf} & = M_{tf} \times M_{idf} \\ = [\begin{array}{c} 12 & 17 & 3 & 0 & . . . \\ 8 & 32 & 1 & 18 & . . . \\ . . . \\ . . . & 9 & 2 & 7 & 11 \end{array}] \times [\begin{array}{c} 0.57 & 0 & 0 & 0 & . . . \\ 0 & 0.89 & 0 & 0 & . . . \\ . . . \\ . . . & 0 & 0 & 0 & 0.42 \end{array}] \\ = [\begin{array}{c} ...let’s just pretend i did the math... \end{array}] \end{aligned}

Then finally we L2 norm to the matrix, row by row.

M_{tf-idf} = \frac{M_{tf-idf}}{| | M_{tf-idf} | |_{2}}

As before, the result can be verified by taking the L2 norm of each row and they'll all come out to 1.0.

Cosine Similarity

So now we have a matrix with a vector for each document containing weighted scores for each term in the vocabulary. That on its own is useful for ranking documents by individual terms, keyword extraction, clustering, and anomaly detection. Cosine similarity adds another layer of usefulness on top of that though by enabling us to search and compare documents effectively.

The process involves taking the dot product of two vectors, and then taking the cosine of the angle between them. The vectors could come from two documents or an external source (like a search query) and a document.

Dot products are fairly straight-forward conceptually. They're the sum of each element multiplied together.

a . b = \sum_{i = 1}^{n} a_{i} b_{i} = a_{1} b_{1} + a_{2} b_{2} + . . . + a_{n} b_{n}

An interesting property appears when the dot product is 0. This happens when two vectors are orthogonal to one another. This is easy to visualize in 2-dimensional space: Two vectors intersecting at a 90 degree angle, the vectors won't project onto each other to form a triangle. The neat thing is that this also holds true in higher dimensional spaces.

The closer two vectors are to orthogonal, the less similar they are. That is, the angle formed between two vectors is a measure of how closely related they are. This is useful for us since two documents where related terms are used but in very different frequencies will still have a small angle, making it easy to identify them as being related.

While two documents which might share a few of the same highly weighted words but not much else will have an angle closer to orthogonal. Basic keyword matching would have marked them as similar, but cosine similarity is able to see past that.

The formula for cosine similarity is straightforward as well, this is trigonometry after all. As a refresher:

c o s θ = \frac{a d j a c e n t}{h y p o t e n u s e}

The adjacent vector is simply the dot product of the other two and the hypotenuse is the same:

\begin{aligned} c o s θ & = \frac{a . b}{| | a | | | | b | |} \\ = \frac{\sum_{i = 1}^{n} a_{i} b_{i}}{\sqrt{\sum_{i = 1}^{n} a^{2}} \sqrt{\sum_{i = 1}^{n} b^{2}}} \end{aligned}

The angle $\cos θ$ is the document's score and can be used to compare it against other documents.

\begin{array}{r} \cos θ_{a} = \frac{x . a}{| | x | | | | a | |} \\ \cos θ_{b} = \frac{x . b}{| | x | | | | b | |} \end{array}

If $\cos θ_{a} > \cos θ_{b}$ then $a$ is less similar to $x$ than $b$ is.

While we're using it in the context of text analysis here, it can be used for any data which can be expressed as matrices, eg: images.