Service Symphony

ML Recommender – Primer

Following on our theme on the series of articles looking at the basics of machine learning concepts, in this post, we are going to look at recommender systems and dimensionality reduction. Recommender systems in ML are primarily used in recommending products to users, based on a variety of input including past purchase history. Application on recommenders includes product recommendation on e-commerce sites, music streaming sites, online movie and video streaming databases like Netflix and YouTube, recommending friends on social networking sites, drug-target interaction in the medical field etc. The key motivation behind recommender systems is personalizing the experiences of the user.

Recommender Systems using Classification

One of the basic approached to building a recommender system is to recommend products to users based on overall popularity. You see this on most news websites, where the most read or shared news articles are listed. Of course, this doesn’t take the preferences and likings of the current reader into consideration, and as a result, there is no personalisation.

The second approach is to build a binary classification model whose input features include information about the user as well as the article, with the prediction labels being whether the user will like a particular article or not.

screen-shot-2016-12-18-at-12-34-54

This approach provides a great degree of personalization taking into consideration multiple contextual parameters regarding the user to whom recommendations are being made, the product being recommended as well as other pertinent parameters. For example, a clothing website may recommend warm clothes during winter. However, this approach relies heavily on the quality of data for the input features. For example, for of a new user, there may not be any purchase history or we may not know enough details about the user.

An alternative to using classification for building recommender systems is collaborative filtering.

Collaborative Filtering

Collaborative filtering works on the idea of linking recommendations to one user based on the purchase history of another user. For example, if you bought a book on Scala, Amazon may recommend to you a book on Haskell, based on the fact that many users who bought a book on Scala also bought a book on Haskell. This results in building a two-dimensional matrix called co-occurrence matrix that provides information on products that are purchased together. If there are N different items on offer, this matrix will be an N x N matrix, where each row and each column will represent an item. The intersection of the row “i” and column “j” indicates the number of times, the product in

This results in building a two-dimensional matrix called co-occurrence matrix that provides information on products that are purchased together. If there are N different items on offer, this matrix will be an N x N matrix, where each row and each column will represent an item. The intersection of the row “i” and column “j” indicates the number of people, who purchased the product in the i-th row and the product in the j-th column. Please note that these purchases don’t need to be made together, it can also be different purchases by the same user.

screen-shot-2016-12-18-at-13-16-24

In the above matrix, users who bought books on Python also bought books on Scala, eight times. Similarly, users who bought books on F# also bought books on Clojure, eight times. Please note that the above matrix is symmetric, that the matrix reflects along the diagonal, because the number of users who bought the i-th and j-th products, will be the same as the number of users who bought j-th and i-th products.

Using the co-occurrence matrix is as simple as the looking the row for the product that has just been purchasing and recommending products along the columns that have the highest co-occurrence counts. So far, someone who just bought a book on Scala, we could recommend books on Python, Clojure and Java.

Normalization

Let us say, in the above matrix, there is a book on introduction to functional programming. Anyone who has bought a book on any of the listed programming languages would have bought the book on introduction to functional programming. This means the cells for any of the items, that intersects with the book on introduction to functional programming will always have a high count. This attenuates the effect of co-occurrence matrix and sways our decision towards based on popularity. To avoid this, we should consider normalizing the co-occurrence matrix.

One mechanism of normalizing the co-occurrence matrix is using Jaccard Similarity, where the count in each cell, the number of people who purchased both the i-th and the j-th product, by the number of people who bought either the i-th or the j-th product.

Limitations of Co-Occurrence Matrix

A simple co-occurrence matrix, makes recommendations based on the current purchase. However, we may want to make recommendations based on past purchase history as well. A solution to this is to find recommendations based on each purchase, do a weighted average for the co-occurrence count for each purchase and sort to make recommendations. In the above example, if the user has bought books on Java and Scala, we will find the co-occurrence counts for the remaining books for Java as well as Scala and take the average. So for example, Java and Python has a co-occurrence count of 7 and Scala and Python a co-occurrence count of 8, the average becomes 7.5.

However, co-occurrence matrix has other limitations, that it doesn’t take into consideration a number of contextual features like user and product information as well as other pertinent factors, and only take into consideration purchase history.

Matrix Factorization

As we discussed, one of the problems with only relying on co-occurrence matrix, is that it totally ignores attributes of users and products. Another approach we can look at leverages on the approach we used for classification. Let us say, we have this big matrix of books rates by users. Each row represents a user and each column represents a book. However, most users would have rated a tiny subset of all the books available, resulting in a very sparse matrix shown below,

screen-shot-2016-12-18-at-23-21-31

As we can see, the matrix above is extremely sparse, as most of the users have not rated most of the books. The task is to learn the ratings for the unrated books based on the few books that have been rated. To be able to do this, we need to know further features about books and users. For example, user U1 liked programming, and “Programming in Scala” happens to be a book on programming, and user U2 who likes programming as well and liked the book “Haskell Programming”, so there is a probability that user U2 will like the “Haskell Programming” book.

So we can have two vectors one for users and one for books, whose elements are weighted numbers for features pertinent to our domain. These vectors could, for example, include numbers to indicate how much the user likes and the book is about a set of features like,

  • Programming
  • Fiction
  • History etc

One possible way to solve this possible is to ascertain the degree of agreement between the feature vector for a given book and the feature vector for a given user, as a basis to predict the rating of that book by that user. We can take these two feature vectors and compute the element-wise product to ascertain the degree of similarity. Now rather than looking at individual users and books in isolation, we will try to use the entire rating sparse matrix and learn about the model to estimate the user and book vectors. For the cells that intersect the i-th user with the j-th column with a known rating, we know it is a function of the dot product of the feature vectors for the i-th user and the j-th book.

Now rather than looking at individual users and books in isolation, we will try to use the entire rating sparse matrix and learn about the model to estimate the user and book vectors. For the cells that intersect the i-th user with the j-th column with a known rating, we know it is a function of the dot product of the feature vectors for the i-th user and the j-th book.

screen-shot-2016-12-19-at-00-10-15

So to fill in the big sparse rating matrix, we compute the dot product of all the user and book vectors. However, to be able to do that, we need to have apriori information on all the elements of all the user and book vectors, which we do not have. This is slightly the classic “Chicken and Egg” problem. So, what we do is, we try to fill in the feature vectors based on the known values from the rating matrix. Thus, the values of the feature vector become the parameters of the model we are trying to estimate.

This becomes a classic ML model estimation problem. We start with an arbitrary set of values for the user and book vectors, we compute the dot product for all intersections for which a rating is known and we calculate the RSS (Residual Sum Square), which is the difference of the predicted rating and actual rating squared and sum them up for all known ratings. This becomes the quality metric and aim of the model optimisation is to reduce this value to find the optimal user and book vectors. Now once we have the best values for the feature vectors, we can predict the ratings for the missing user and book combinations.

The technique is called matrix factorisation, which, still doesn’t solve the problem for a new user or a new movie where no ratings are available. One option to use a blended model of classification based on user features and matrix factorisation. Here when a new user is on-boarded, we can use features collected from the user to make initial predictions (Similar to what Netflix does during user registration), and as we collect more and more information from the user lean more towards matrix factorization.

Recommender Systems Using Apache Spark

Here, we will have a look at an example from the Spark tutorial on collaborative filtering, on sample movie rating. Data is available in $SPARK_HOME/data/als/, The name of the file is sample_movielens_ratings.txt. It is a delimited file with four fields, user ID, movie ID, movie rating given by a user to a movie and the timestamp. This example is from Spark tutorials.

First, we will import the necessary classes from the Spark shell. Spark provides an ALS (Alternating Least Square) based recommender implementation.

scala> import org.apache.spark.ml.evaluation.RegressionEvaluator
scala> import org.apache.spark.ml.recommendation.ALS

Next, we will define a case class that represents the rating, and a function to parse each record into an instance of the case class.

scala> case class Rating(userId: Int, 
     | movieId: Int,
     | rating: Float,
     | timestamp: Long)
scala> def parseRating(str: String): Rating = {
     |   val fields = str.split("::")
     |   assert(fields.size == 4)
     |   Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
     | }

We will now read the file and create a dataframe of rating instances, and split the dataframe into training and test data.

scala> val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
scala> val ratings = spark.read..
     | textFile("data/mllib/als/sample_movielens_ratings.txt").
     | map(parseRating).
     | toDF()

Next we will create the ALS instance, with some rudimentary configuration parameters and train the model,

scala> val als = new ALS().
     | setMaxIter(5).
     | setRegParam(0.01).
     | setUserCol("userId").
     | setItemCol("movieId").
     | setRatingCol("rating")
scala> val predictions = model.transform(test)

Now, we can use the model to predict the data,

scala> val predictions = model.transform(test)
scala> predictions.show
+------+-------+------+----------+-----------+                                  
|userId|movieId|rating| timestamp| prediction|
+------+-------+------+----------+-----------+
|    19|     31|   1.0|1424380312|  2.1931064|
|     8|     31|   3.0|1424380312|  2.3882465|
|    28|     85|   1.0|1424380312|  4.8720865|
|    26|     85|   1.0|1424380312|  2.5435927|
|    15|     85|   1.0|1424380312|  3.7016292|
|    28|     65|   1.0|1424380312|   5.811983|
|    23|     53|   1.0|1424380312|  0.8447959|
|    25|     53|   1.0|1424380312| -0.9609778|
|    14|     53|   3.0|1424380312|  -2.645194|
|    24|     78|   1.0|1424380312|  1.1013911|
|     2|     78|   1.0|1424380312| 0.89820814|
|     3|     34|   3.0|1424380312|-0.29750586|
|    19|     34|   1.0|1424380312|-0.39372563|
|    17|     34|   1.0|1424380312| -0.5189664|
|    14|     34|   1.0|1424380312|  1.9340911|
|     0|     34|   1.0|1424380312|  1.0589211|
|     1|     81|   1.0|1424380312|  1.1111434|
|     3|     81|   1.0|1424380312|  2.2576268|
|    19|     81|   1.0|1424380312| 0.25459975|
|    29|     81|   2.0|1424380312|  2.0067518|
+------+-------+------+----------+-----------+

As you can see, the predictions are pretty poor. We can have a look at how we can improve the predictions in the coming posts. In the meanwhile, we can evaluate the quality metric of the model to print the RMSE as below,

scala> val evaluator = new RegressionEvaluator().
     | setMetricName("rmse").
     | setLabelCol("rating").
     | setPredictionCol("prediction")
scala> val rmse = evaluator.evaluate(predictions)
scala> println(s"Root-mean-square error = $rmse")
Root-mean-square error = 1.8628985194757315

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *

four × four =

Some of our
Clients
Service Symphony

Service symphony © 2024

Quick Links
Contact Info

20-22 Wenlock Road
London
N1 7GU
United Kingdom

Support

0800 011 4386