Numeric Scoring Metrics: Find the Right Metric for a Prediction Model
Quantitative data have endless stories to tell!
Daily closing prices tell us about the dynamics of the stock market, small smart meters about the energy consumption of households, smartwatches about what’s going on in the human body during an exercise, and surveys about some people’s self-estimation of a topic at some point in time. Different types of experts can tell these stories: financial analysts, data scientists, sports scientists, sociologists, psychologists and so on. Their stories are based on models, for example, regression models, time series models and ANOVA models.
Why Are Numeric Scoring Metrics Needed?
These models have many consequences in the real world, from the decisions of the portfolio managers to the pricing of electricity at different times of the day, week and year. Numeric scoring metrics are needed in order to:
- Select the most accurate model
- Estimate the real-world impact of the error of the model
In this article, we will describe five real-world use cases of numeric prediction models, and in each use case, we measure the prediction accuracy from a slightly different point of view. In one case, we measure if a model has a systematic bias, and in another, we measure a model’s explanation power. The article concludes with a review of the numeric scoring metrics, showing the formulas to calculate them, and a summary of their properties. We’ll also link to a few example implementations of building and evaluating a prediction model in KNIME Analytics Platform.
Five Metrics: Five Different Perspectives on Prediction Accuracy
(Root) Mean Squared Error, (R)MSE – Which model best captures the rapid changes in the volatile stock market?
In Figure 1, below, you see the development of the LinkedIn closing price from 2011 to 2016. Within the time period, the behavior includes sudden peaks, sudden lows, longer periods of increasing and decreasing value, and a few stable periods. Forecasting this kind of volatile behavior is challenging, especially in the long term. However, for the stakeholders of LinkedIn, it’s valuable. Therefore, we prefer a forecasting model that captures the sudden changes to a model that performs well on average over the period of five years.
We select the model with the lowest (root) mean squared error because this metric weights big errors more compared to small errors and favors a model that can react to short-term changes and save the stakeholders’ money.
Mean Absolute Error, MAE – Which model best estimates the energy consumption in the long term?
In Figure 2, you can see the hourly energy consumption values in July 2009 in Dublin, collected from a cluster of households and industries. The energy consumption shows a relatively regular pattern, with higher values during working hours and on weekdays and lower values at night and during weekends. This kind of a regular behavior can be forecasted relatively accurately, allowing for long-term planning of the energy supply. Therefore, we select a forecasting model with the lowest mean absolute error. We do this because it weights big and small errors equally, is therefore robust to outliers, and shows which model has the highest forecast accuracy over the whole time period.
Mean Absolute Percentage Error, MAPE – Are the sales forecasting models for different products equally accurate?
On a hot summer day, the supply of both sparkling water and ice cream should be guaranteed! We want to check if the two forecasting models that predict the sales of these two products are equally accurate.
Both models generate forecasts in the same unit, the number of sold items, but at a different scale since sparkling water is sold in much larger volumes than ice cream. In this kind of a case, we need a relative error metric and use mean absolute percentage error, which reports the error relative to the actual value. In Figure 3, in the line plot on the left, you see the sales of sparkling water (purple line) and the sales of ice cream (green line) in June 2020 as well as the predicted sales of both products (red lines). The prediction line seems to deviate slightly more for sparkling water than for ice cream. However, the larger actual values of sparkling water bias the visible comparison. Actually, the forecasting model performs better for sparkling water than for ice cream, as reported by the MAPE values 0.191 for sparkling water and 0.369 for ice cream.
Notice, though, that MAPE values can be biased when the actual values are close to zero. For example, the sales of ice cream are relatively low during the winter months compared to summer months, whereas sales of milk remain pretty constant through the entire year. When we compare the accuracies of the forecasting models for milk vs. ice cream by their MAPE values, the small values in the ice cream sales make the forecasting model for ice cream look unreasonably bad compared to the forecasting model for milk.
In Figure 3, in the line plot in the middle, you see the sales of milk (blue line) and ice cream (green line) and the predicted sales of both products (red lines). If we take a look at the MAPE values, the forecasting accuracy is apparently much better for milk (MAPE = 0.016) than for ice cream (0.266). However, this huge difference is due to the low values of ice cream sales in the winter months. The line plot on the right in Figure 3 shows exactly the same actual and predicted sales of ice cream and milk, with ice cream sales scaled up by 25 items for each month. Without the bias from the values close to zero, the forecasting accuracies for ice cream (MAPE=0.036) and milk (MAPE=0.016) are now much closer to each other.
Mean Signed Difference – Does a running app provide unrealistic expectations?
A smartwatch can be connected to a running application which then estimates the finishing time in a 10k run. It could be that, as a motivator, the app estimates the time lower than what’s realistically expected.
To test this, we collect the estimated and realized finishing times from a group of runners for six months and plot the average values in the line plot in Figure 4. As you can see, during the six months, the realized finishing time (orange line) decreases more slowly than the estimated finishing time (red line). We confirm the systematic bias in the estimates by calculating the mean signed difference between the actual and estimated finishing times. It’s negative (-2.191), so the app indeed raises unrealistic expectations! Notice, though, that this metric is not informative about the magnitude of the error because if there’s a runner who actually runs faster than the expected time, this positive error compensates a part of the negative error.
R-squared – How much of our years of education can be explained through access to literature?
In Figure 5, you can see the relationship between the access to literature (x-axis) and years of education (y-axis) in a sample of the population. A linear regression line is fitted to the data to model the relationship between these two variables. To measure the fit of the linear regression model, we use R-squared.
R-squared tells how much of the variance of the target column (years of education) the model explains. Based on the R-squared value of the model, 0.76, the access to literature explains 76% of the variance in the years of education.
A Review of the Five Numeric Scoring Metrics
The numeric scoring metrics introduced above are shown in Figure 6. The metrics are listed along with the formulas used to calculate them and a few key properties of each. In the formulas, yi is the actual value and f(xi) is the predicted value.
In this article, we’ve introduced the most commonly used error metrics and the perspectives that they provide to the model’s performance.
It’s often recommended to take a look at multiple numeric scoring metrics to gain a comprehensive view of the model’s performance. For example, by reviewing the mean signed difference, you can see if your model has a systematic bias, whereas by studying the (root) mean squared error, you can see which model best captures the sudden fluctuations. Visualizations, a line plot, for example, complement the model evaluation.
For a practical implementation, take a look at the example workflows built in the visual data science tool KNIME Analytics Platform.
Download and inspect these free workflows from the KNIME Hub:
Feature image via Pixabay.