Development / Machine Learning / Tools / Contributed

Cohen’s Kappa: What It Is, When to Use It, and How to Avoid Its Pitfalls

4 Aug 2020 9:02am, by

Maarit Widmann
Maarit Widmann is a data scientist at KNIME. She started with quantitative sociology and holds her bachelor’s degree in social sciences. The University of Konstanz made her drop the “social” as a Master of Science. Her ambition is to communicate the concepts behind data science to others in videos and blog posts. Follow Maarit on LinkedIn. For more information on KNIME, please visit www.knime.com and the KNIME blog.

Cohen’s kappa is a metric often used to assess the agreement between two raters. It can also be used to assess the performance of a classification model.

For example, if we had two bankers and we asked both to classify 100 customers in two classes for credit rating (i.e., good and bad) based on their creditworthiness, we could then measure the level of their agreement through Cohen’s kappa.

Similarly, in the context of a classification model, we could use Cohen’s kappa to compare the machine learning model predictions with the manually established credit ratings.

Like many other evaluation metrics, Cohen’s kappa is calculated based on the confusion matrix. However, in contrast to calculating overall accuracy, Cohen’s kappa takes imbalance in class distribution into account and can, therefore, be more complex to interpret.

In this article, we will guide you through the calculation and interpretation of Cohen’s kappa values, particularly in comparison to overall accuracy values. We will show that where overall accuracy fails because of a large imbalance in the class distribution, Cohen’s kappa might supply a more objective description of the model performance. Along the way, we will also introduce a few tips to keep in mind when interpreting Cohen’s kappa values.

Measuring Performance Improvement on Imbalanced Datasets

Let’s focus on a classification task on bank loans, using the German credit data provided by the UCI Machine Learning Repository. In this dataset, bank customers have been assigned either a “bad” credit rating (30%) or a “good” credit rating (70%) according to the criteria of the bank. For the purpose of this article, we exaggerated the imbalance in the target class credit rating via bootstrapping, giving us 10% with a “bad” credit rating and 90% with a “good” credit rating: a highly imbalanced dataset. Exaggerating the imbalance helps us to make the difference between “overall accuracy” and “Cohen’s kappa” clearer in this article.

Let’s partition the data into a training set (70%) and a test set (30%) using stratified sampling on the target column and then train a simple model, e.g., a decision tree. Given the high imbalance between the two classes, the model will not perform very well. Nevertheless, let’s use its performance as the baseline for this study.

Figure 1 shows the confusion matrix and accuracy statistics for this baseline model. The overall accuracy of the model is quite high (87%) and hints at an acceptable performance by the model. However, in the confusion matrix, we can see that the model is able to classify only nine out of the 30 credit customers with a bad credit rating correctly. This is also visible by the low sensitivity value of class “bad” — just 30%. Basically, the decision tree is classifying most of the “good” customers correctly and neglecting the necessary performance on the few “bad” customers. The imbalance in the class a priori probability compensates for such sloppiness in classification. Let’s note for now that the Cohen’s kappa value is just 0.244, within its range of [-1,+1].

Figure 1: Confusion matrix and accuracy statistics for the baseline model, a decision tree model trained on the highly imbalanced training set. The overall accuracy is relatively high (87%), although the model detects just a few of the customers with a bad credit rating (sensitivity at just 30%).

Let’s try to improve the model performance by forcing it to acknowledge the existence of the minority class. We train the same model this time on a training set where the minority class has been oversampled using the SMOTE technique, reaching a class proportion of 50% for both classes.

To provide more detail about the confusion matrix for this model, 18 out of the 30 customers with a “bad” credit rating are detected by the model, leading to a new sensitivity value of 60%. Cohen’s kappa statistics is now 0.452 for this model, which is a remarkable increase from the previous value 0.244. But what about overall accuracy? For this second model, it’s 89%, not very different from the previous value of 87%.

When summarizing, we get two very different pictures. According to the overall accuracy, model performance hasn’t changed very much at all. However, according to Cohen’s kappa, a lot has changed! Which statement is correct?

Figure 2: Confusion matrix and accuracy statistics for the improved model. The decision tree model trained on a more balanced training set, where the minority class has been oversampled. The overall accuracy is almost the same as for the baseline model (89% vs. 87%). However, the Cohen’s kappa value shows a remarkable increase from 0.244 to 0.452.

From the numbers in the confusion matrix, it seems that Cohen’s kappa has a more realistic view of the model’s performance when using imbalanced data. Why does Cohen’s kappa take more notice of the minority class? How is it actually calculated? Let’s take a look.

Cohen’s Kappa

Cohen’s kappa is calculated with the following formula.

where is the overall accuracy of the model and is the measure of the agreement between the model predictions and the actual class values as if happening by chance.

In a binary classification problem like ours,is the sum of , the probability of the predictions agreeing with actual values of class 1 (“good”) by chance, and, the probability of the predictions agreeing with the actual values of class 2 (“bad”) by chance. Assuming that the two classifiers — model predictions and actual class values — are independent, these probabilities, and, are calculated by multiplying the proportion of the actual class and the proportion of the predicted class.

Considering “bad” as the positive class, the baseline model (Figure 1) assigned 9% of the records (false positives plus true positives) to class “bad” and 91% of the records (true negatives plus false negatives) to class “good.” Thus is:

And therefore, Cohen’s kappa statistics is:

which is the same value as reported in Figure 1.

Practically, Cohen’s kappa removes the possibility of the classifier and a random guess agreeing and measures the number of predictions it makes that cannot be explained by a random guess. Furthermore, Cohen’s kappa tries to correct the evaluation bias by taking into account the correct classification by a random guess.

Pain Points of Cohen’s Kappa

At this point, we know that Cohen’s kappa is a useful evaluation metric when dealing with imbalanced data. However, Cohen’s kappa has some downsides, too. Let’s have a look at them.

Full range [-1, +1], but not equally reachable

It’s easier to reach higher values of Cohen’s kappa if the target class distribution is balanced.

For the baseline model (Figure 1), the distribution of the predicted classes follows closely the distribution of the target classes: 27 predicted as “bad” vs. 273 predicted as “good” and 30 being actually “bad” vs. 270 being actually “good.”

For the improved model (Figure 2), the difference between the two class distributions is greater: 40 predicted as “bad” vs. 260 predicted as “good” and 30 being actually “bad” vs. 270 being actually “good.”

As the formula for maximum Cohen’s kappa shows, the more the distributions of the predicted and actual target classes differ, the lower the maximum reachable Cohen’s kappa value is. The maximum Cohen’s kappa value represents the edge case of either the number of false negatives or false positives in the confusion matrix being zero, i.e., all customers with a good credit rating, or alternatively all customers with a bad credit rating, are predicted correctly.

where   is the maximum reachable overall accuracy of the model given the distributions of the target and predicted classes:

For the baseline model, we get the following value for :

Whereas for the improved model it is:

The maximum value of Cohen’s kappa is then for the baseline model:

For the improved model it is:

As the results show, the improved model with a greater difference in the distributions between the actual and predicted target classes can only reach a Cohen’s kappa value as high as 0.853. Whereas the baseline model can reach the value 0.942, despite the worse performance.

Cohen’s kappa is higher for balanced data

When we calculate Cohen’s kappa, we strongly assume that the distributions of target and predicted classes are independent and that the target class doesn’t affect the probability of a correct prediction. In our example, this would mean that a credit customer with a good credit rating has an equal chance of getting a correct prediction as a credit customer with a bad credit rating. However, since we know that our baseline model is biased toward the majority “good” class, this assumption is violated.

If this assumption were not violated, such as in the improved model where the target classes are balanced, we could reach higher values of Cohen’s kappa. Why is this? We can rewrite the formula of Cohen’s kappa as the function of the probability of the positive class, and the function reaches its maximum when the probability of the positive class is 0.5. We test this by applying the same improved model (Figure 2) to different test sets, where the proportion of the positive “bad” class varies between 5% and 95%. We create 100 different test sets per class distribution by bootstrapping the original test data and calculate the average Cohen’s kappa value from the results.

Figure 3 shows the average Cohen’s kappa values against the positive class probabilities — and yes, Cohen’s kappa does reach its maximum when the model is applied to balanced data.

Figure 3. Cohen’s kappa values (on the y-axis) obtained for the same model with varying positive class probabilities in the test data (on the x-axis). The Cohen’s kappa values on the y-axis are calculated as averages of all Cohen’s kappas obtained via bootstrapping the original test set 100 times for a fixed class distribution. The model is the decision tree model trained on balanced data (Figure 2).

Cohen’s kappa says little about the expected prediction accuracy

The numerator of Cohen’s kappa,, tells the difference between the observed overall accuracy of the model and the overall accuracy that can be obtained by chance. The denominator of the formula,
, tells the maximum value for this difference.

For a good model, the observed difference and the maximum difference are close to each other, and Cohen’s kappa is close to 1. For a random model, the overall accuracy is all due to random chance, the numerator is 0, and Cohen’s kappa is 0. Cohen’s kappa could also theoretically be negative. Then, the overall accuracy of the model would be even lower than what could have been obtained by a random guess.

Given the explanation above, Cohen’s kappa is not easy to interpret in terms of expected accuracy, and it’s often not recommended to follow any verbal categories as interpretations. For example, if you have 100 customers and a model with an overall accuracy of 87%, then you can expect to predict the credit rating correctly for 87 customers. Cohen’s kappa value 0.244 doesn’t provide you with an interpretation as easy as this.

Summary

We’ve explained how to use and interpret Cohen’s kappa to evaluate the performance of a classification model. While Cohen’s kappa can correct the bias of overall accuracy when dealing with unbalanced data, it has a few shortcomings. So, the next time you take a look at the scoring metrics of your model, remember:

  1. Cohen’s kappa is more informative than overall accuracy when working with unbalanced data. Keep this in mind when you compare or optimize classification models.
  2. Take a look at the row and column totals in the confusion matrix. Are the distributions of the target/predicted classes similar? If not, the maximum reachable Cohen’s kappa value will be lower.
  3. The same model will give you lower values of Cohen’s kappa for unbalanced than for balanced test data.
  4. Cohen’s kappa says little about the expected accuracy of a single prediction.

The workflow used for this study is shown in Figure 4. In the workflow, we train, apply and evaluate two decision tree models that predict the creditworthiness of credit customers. In the top branch, we train the baseline model, while in the bottom branch, we train the model on the bootstrapped training set using the SMOTE technique. This workflow is downloadable from the Cohen’s Kappa for Evaluating Classification Models page on the KNIME Hub.

Figure 4: This KNIME workflow trains two decision trees to predict the credit score of customers. In the top branch, a baseline model is trained on the unbalanced training data (90% “good” vs. 10% “bad” class records). In the bottom branch, an improved model is trained on a new training dataset where the minority class has been oversampled (SMOTE). The workflow Cohen’s Kappa for Evaluating Classification Models is available on the KNIME Hub.

Feature image via Pixabay.

At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.

A newsletter digest of the week’s most important stories & analyses.