Service Symphony

ML Clustering – Primer

Clustering is an ML discipline of finding similarity between a large amount of recorded observations. Clustering can be either supervised, were there are known mappings from recorded observations to known labels, in which case it is essentially multi-label classifications, or it can be unsupervised where large amount of recorded observations are arbitrarily grouped into clusters based on similarity between them.

One of the popular applications of clustering is in document classification and retrieval. An example would be retrieving articles you may be interested in based on articles you already like. This relies on ascertaining similarity between articles you have liked and articles present in a large corpus. One of the key challenges of clustering is about establishing similarity between articles.

Document Similarity

One of the most simplistic means of representing documents to ascertain their similarity is the “bag of words” model, where the word order in a documented is ignored and the document is basically transformed to a map of each word in that document and the number of times that word appears in that document. This can be a simple indexed vector, where each index represents a given word in the document vocabulary and contains the count of occurrences for the indexed word. To represent the similarity between two documents we compute the matrix dot product of the two word count vectors, by performing element-wise product of the two vectors, and summing them up. Higher the dot-product, higher the similarity is between the two documents.

Normalised Word Count Vectors

One of the drawbacks of using the dot-product of simple word count vectors to ascertain similarity is that it biases towards long documents. Let us say, if we duplicate the contents of two documents we are comparing, and compute the dot product, count of each word count in each document is doubles and hence the dot-product is quadrupled. One way to circumvent this issue is to normalise the word count vector. The from of a vector is by dividing each element the the square root of the sum of squares of each element in the vector. This puts every document that is being compared on a normalised scale regardless of their length.

TF-IDF Term Frequency Inverse Document Frequency

A normalised word count vector is still rather simplistic, that it doesn’t distinguish between commonly occurring words and words that are deemed more important within a corpus. Common words are the ones that appear frequently across all the documents in a corpus. So if we give the same weighting for all the words, albeit normalised, they marginalise rare words that appear only in certain documents within a corpus that signify the higher similarity between those documents. One way to address this is de-emphasising the weight of each word based on how many documents that word appears in the entire corpus. One key dimension here is to ascertain the relevance of a word within a given document, by determining how common that word is locally to the document to how rare it is globally within the corpus.

One algorithm to address this is TF-IDF. Term frequency if the simple word-count vector within a document. Now each entry is down weighted by calculating the inverse document frequency. Inverse document frequency of a word is log(number of documents / (1 + documents using the word)). This means for words that appear in a large number of documents, the IDF will be close to log(1), which is zero. Hence, we significantly down weight words that appear in majority of the documents in a corpus. With TF IDF, the vector is formed by multiplying the corresponding elements of the TF and IDF vectors. By doing this words which are common locally and rare globally are up-weighted, whereas words that are common globally down-weighted.

Document Retrieval – Nearest Neighbour

Now we have looked at means to represent a document to ascertain its similarity, let us have a look at how this can be used for searching similar documents in a corpus. Nearest neighbour search is a mechanism for retrieving similar articles to a given article. This is based on a distance metric. In 1 – nearest neighbour algorithm, every document in the corpus is searched by calculating the dot product between the TF-IDF vector of the query article and every other article, and the article with the highest dot product is returned. A variant of this is the k – nearest neighbour search, where the k most related articles are returned.

Clustering and Unsupervised Learning

As mentioned earlier clustering can be a supervised learning exercise, where a document is assigned to one of a known category, where it becomes a multi-class classification. However, clustering can also be unsupervised, where a large number of known observations can be grouped based on similarity. This involved identifying clusters of related information. It is always possible to retrospectively label the identified clusters of information. A cluster is defined by its centre and its shape. Assigning an observation to a cluster depends on how close it is to a cluster centre, which denotes how similar the observation is to the other observations in the same cluster.

K-Means Clustering Algorithm

This algorithm assumes K number of clusters and start with an arbitrary K cluster centres from the recorded observations. Then every observation is assigned to the cluster centre, based on the similarity metric that is used. The process is repeated by adjusting the cluster centre as a mean of all observations in the assigned cluster. This whole process is iterated until convergence, where there are no further movement of observations within the cluster.

Putting Theory into Practice

In this section we will have a look at putting the theory we have covered in this post into practice. We will look at ascertaining similarity between Wikipedia documents and also look at K-Means clustering using Apache Spark. All the code can be downloaded from the GitHub repository here.

Document Similarity

In this example, we look at documents from Wikipedia, compute their TF-IDF and compute the dot product of the TF-IDF vectors of two documents that are similar and two documents that are dissimilar. As an example, for similar documents we will look at the US presidents Barack Obama and Bill Clinton, and as for dissimilar documents we will look at Barack Obama and the English footballer, David Beckham.

The Wikipedia documents are stored as a three field CSV file, indicating the URI, name and the actual document content. First we load the CSV into a data frame as shown below,

val spark = SparkSession.
appName("Clustering Basics").
val df =
option("header", "false").

Next we initialise the transformers to tokenise the document content, create the term frequency vector and then the inverse document frequency vector.

val tk = new Tokenizer().
val tf = new HashingTF().
val idf = new IDF().

Then we transform the data frame to create the TF-IDF vectors. Note that generating IDF, will require two passes through the document.

val terms = tf.transform(tk.transform(df))
val idfs =

We define a function, that returns the TF-IDF vector for a document, given the URI and use that to ascertain the TF-IDF vector for Obama, Clinton and Beckham.

def getTfIdf(uri: String, ds: DataFrame) = {
val r = ds.filter(s"_c0 = '$uri'").take(1)
val obamaTfIdf = getTfIdf(
"<>", idfs)
val clintonTfIdf = getTfIdf(
"<>", idfs)
val beckhamfIdf = getTfIdf(
"<>", idfs)

Next, we define a function to calculate the dot product of the two TF-IDF vectors as a similarity Metric.

def dotProduct(v1: Vector, v2: Vector) = {
var dp = 0.0
var index = v1.size - 1
for (i <- 0 to index) {
dp += v1(i) * v2(i)

We use the above method to print the metric between Obama and Clinton as well as Obama and Beckham.

println("Similarity metric between Obama and Clinton is " 
+ dotProduct(obamaTfIdf, clintonTfIdf))
println("Similarity metric between Obama and Beckham is " 
+ dotProduct(obamaTfIdf, beckhamfIdf))

This produces the following output, as you can see the similarity between Obama and Clinton are really high and that between Obama and Beckham are very low.

Similarity metric between Obama and Clinton is 1660.4290319305376
Similarity metric between Obama and Beckham is 190.16986123178032

Nearest Neighbour Search

Spark doesn’t provide a nearest neighbour search algorithm at the moment. However, there is work in progress for a 2.2 release. The code below does a brute force search by creating a dot product for the IDF vectors for the query object and the rest of the data set and sorting by the dot product. However, this is extremely inefficient

val x = => (r.getString(1), 
dotProduct(obamaTfIdf, r.getAs[Vector]("tf-idf"))))
import org.apache.spark.sql.functions._

This will print the following, with unsurprisingly, with Obama’s vice president, Joe Biden, and secretary of state, Hillary Clinton at the top of the list.

|                  _1|                _2|
|           Joe Biden|3112.9982598703614|
|Hillary Rodham Cl...| 2797.205236948505|
|          Lauro Baja|  2463.53022658448|
|       Chris Redfern|2360.4456928199847|
|           Abe Gupta| 2298.764992961675|
|        Ruud Lubbers| 2282.137040439511|
|       Joe Lieberman|2229.8959270210435|
|Saber Hossain Cho...|2118.0095494329726|
|Steve Rauschenberger|  2106.03287145978|
| M. Cherif Bassiouni|2059.5591850310248|

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *

nine + nine =

Some of our