ML Classification – Primer
ML classification is another important tool for supervised learning. Similar to linear regression, classification is about learning from known features against known labels, to predict unknown labels from new set of features. However, unlike linear regression where the predicted label is of a continuos value, in classification the label belongs to distinct categories. However, regression is often used as a means for classification.
Examples of classification includes,
- Spam filtering: Based on known email messages that are labelled as spam, build a learning model that can be applied to new emails messages to determine whether they are spam or not.
- Sentiment analysis: Based on known review comments and corresponding categorical product ratings, predict the product rating for new review comments.
- Document classification: Based on known classification of documents as scientific, entertainment, sport etc. based on content, classify new documents.
In this post we will look at the basics of classification, and built a practical example using Spark ML library for predicting product rating by customers.
An Example Model
Let us look at review comments against products and corresponding review datings which are either good or bad. For example, if you look at product rating on Amazon, you have a review text and an associated star rating. An example of a review text may be along the lines – “This is the best Scala book I have ever read. However the content was slightly lacking examples, and the examples that were there not very good.”
So to analyse the sentiment regarding the review, one could break it into sentences and look at positive and negative words within each sentence. The words may occur one or more times, and simplistically, one could conclude the review to be positive if the number of positive words is more than that of negative words.
Obviously, that would be a very simplistic and rudimentary method to approach the problem. Based on historic information and intelligence gathered, not every word may have the same positive or negative weightage. Also the word “good” may be considered as positive, however “not very good” not necessarily so. This means tokenisation of the text needs to be more intelligent.
Linear Classification and Decision Boundaries
With linear classification, the words within the text that are analysed have all weighted positive or negative values, obviously based on historic gathered intelligence. Let us have a look at the following table that has the weights for the various words,
If we apply the weighting to the review text, “This is a good Scala book. The topics have covered have been fantastic, though the listed examples are poor.”, we will get a total score of 1.0 + 1.5 – 1.8 = 0.7. Since the total score is greater than 0, the review may be predicted as a good review.
Let us say we are working with just two words from the above list, fantastic and poor, for each review text we plot the number of occurrences of the fantastic against the number of occurrences of the word poor as below,
The line is were the weighted sum of the occurrences is zero, which is the decision boundary. Any observation above the line is deemed negative and any observation below is deemed positive. Obviously, with two features the decision is boundary is a line, with more than two features the decision boundary becomes multi-dimensional planes.
What is a good model?
In classification model, same as regression, we split the data into training and test. The classification algorithm will learn the weights of words from the training set. This is then applied to the test data set. The prediction error is measured in terms of accuracy in predicting the label for the test data against their true label. So if you have a hundred reviews in the test set, and the model gets 98 of them correct, the accuracy of the qualification is 0.98. Conversely, one could also say the error is 0.02. So the sum of error and accuracy will always add up to 1.
Let us say if we have two categorical values of labels that can be predicted. So the random probability of getting the prediction right is 0.5. So the classification should at least perform better than the random probability. However, what is deemed as a good accuracy? It really depends on the use case. For example, in spam filtering if you get a false negative, that is deciding an email to be not spam when it is spam, you only have the annoyance of a spam email appearing in your inbox. However, a false positive, where a legitimate email is deemed as spam, has far graver consequence. If there is a heavy class imbalance, you will tend to get high accuracy, but that doesn’t necessarily imply it is good accuracy.
So end of the day, one has to make the decision based on specific use case, what can be deemed as good accuracy and balance the costs of false positives Vs false negatives.
Let us look a binary classification model, with a possible results of good and bad ratings. Predictions based on this can result in the following results,
- True label is good and the predicted label is good, which is a true positive
- True label is good and the predicted label is bad, which is a false negative
- True label us bad and the predicted label is bad, which is a true negative
- True label is bad and the predicted label is good, which is a false positive
This can be extrapolated to any multi-category model. Let us say, we have a sample space of 100 known observations, with 70 good labels and 30 bad labels. Running classification algorithm resulted, 60 of the 70 good labels from being predicted correctly and 10 being predicted incorrectly as bad. Also, of the thirty true bad labels, 25 got predicted correctly and 5 incorrectly. This is represented in a matrix below, which is termed the confusion matrix.
|Predicted Label – Good||Predicted Label – Bad|
|True Label – Good||60||10|
|True Label – Bad||5||25|
This gives an overall accuracy of 0.85 and an error of 0.15.
Learning Curves and Probabilities
We looked at predicted labels in classification. However, determining the quality of a model, there is also a factor that ascertain the probability of a prediction. Another important aspect the amount of training data that is available. Availability of good quality of training data will decreases the amount of error, till the error plateaus of. This curve is called the learning curve. The point at which the error plateaus is called the bias of the model.
Applying the Theory
Now, let us apply the theory we have covered so far in a practical example. All the code and data used can be found here. The example uses Spark ML logistic regression to perform the classification. The data we use is review comments and ratings on Amazon baby products. Let us first load the data and have a quick look.
val df = spark. read. option("header", "true"). option("inferSchema", "true"). csv("amazon_baby.csv") df.show(10)
This will produce the following output.
| name| review|rating| +--------------------+--------------------+------+ |Planetwise Flanne...|These flannel wip...| 3| |Planetwise Wipe P...|it came early and...| 5| |Annas Dream Full ...|Very soft and com...| 5| |Stop Pacifier Suc...|This is a product...| 5| |Stop Pacifier Suc...|All of my kids ha...| 5| |Stop Pacifier Suc...|When the Binky Fa...| 5| |A Tale of Baby's ...|Lovely book, it's...| 4| |Baby Tracker®...|Perfect for new p...| 5| |Baby Tracker®...|A friend of mine ...| 5| |Baby Tracker®...|This has been an ...| 4| +--------------------+--------------------+------+
The dataset contains three columns. The name of the product, the review comment and the rating. The rating is a number between one and five. To improve the model, let us use only a subset of the words from the review comments. Also, for binary classification let us consider all ratings above fours as good and below three as bad. To achieve this, we will define two UDF (user defined function) that can be used within the Spark data frames for transforming the columns.
val keywords = Array("awesome", "great", "fantastic", "amazing", "love", "horrible", "bad", "terrible", "awful", "wow", "hate") val filterWords = udf( (x: String) => if (x != null) x.split(" ").filter(keywords.contains(_)).mkString(" ") else "" ) val isGood = udf((x: Int) => if (x >= 4) 1 else 0) val data = df. where("rating != 3"). withColumn("label", isGood('rating)). withColumn("cleansed", filterWords('review)). where("cleansed != ''") data.select("name", "cleansed", "label").show(10)
The snippet above adds the two columns to the data frame, one a binary classification for rating, and second a cleansed review text with only the selected words. We also filter out rows that don’t contain any of the selected words in the review.
+--------------------+---------+-----+ | name| cleansed|label| +--------------------+---------+-----+ |Planetwise Wipe P...| love| 1| |Stop Pacifier Suc...|love love| 1| |Stop Pacifier Suc...| great| 1| |Stop Pacifier Suc...| great| 1| |Baby Tracker®...|love love| 1| |Nature's Lullabie...| love| 1| |Nature's Lullabie...| amazing| 1| |Nature's Lullabie...| great| 1| |Lamaze Peekaboo, ...| great| 1| |Lamaze Peekaboo, ...| great| 1| +--------------------+---------+-----+
Next we create an ML pipeline, with
- A tokeniser that splits the review sentence into words
- A count vectorizer, that converts the words into a feature vector based on the selected words array
- And a logistic regression classifier
We split the data into a known seed of 80% training and 20% test and evaluate the test data against the model built using training and print the evaluation results.
val classifier = new LogisticRegression() val tokenizer = new Tokenizer(). setInputCol("cleansed"). setOutputCol("words") val cvm = new CountVectorizerModel(keywords). setInputCol("words"). setOutputCol("features") val Array(training, test) = cvm.transform( tokenizer.transform(data)). randomSplit(Array(0.8, 0.2), 1) val model = classifier.fit(training) model.evaluate(test). predictions.select("words", "label", "prediction", "probability"). show(10)
This produces the following result,
+--------------+-----+----------+--------------------+ | words|label|prediction| probability| +--------------+-----+----------+--------------------+ | [awesome]| 1| 1.0|[0.07779035590467...| | [love, great]| 1| 1.0|[0.02803028349983...| | [great]| 1| 1.0|[0.09041606062371...| | [great, bad]| 1| 1.0|[0.22953742954826...| |[great, great]| 1| 1.0|[0.05088985011924...| | [love]| 1| 1.0|[0.05075083853825...| | [love]| 1| 1.0|[0.05075083853825...| | [love]| 1| 1.0|[0.05075083853825...| | [love]| 1| 1.0|[0.05075083853825...| | [love]| 1| 1.0|[0.05075083853825...| +--------------+-----+----------+--------------------+