Snowflake sponsored this post.
Every business wants to be more data driven, and artificial intelligence technologies are an important way to augment traditional statistical analysis. But don’t let the hype around AI persuade you to force it onto problems where it’s not needed. That will only lead to poor outcomes and wasted resources, and undermine future initiatives. This article will help you decide when and how to apply AI technologies to maximize their business value.
It’s helpful to step back and consider why AI is so valuable. Traditional approaches limit the data sets you can consume from a sheer scale and size perspective. When you put machine learning behind a problem, you’re really applying engineering to handle the abstractions for compute in a way that allows you to navigate much larger data sets than traditional approaches allow.
“Traditional” in this sense isn’t a euphemism for old-fashioned or legacy. A lot of what we do with machine learning and deep learning is still grounded in traditional approaches, it just allows us to apply those approaches at a much larger scale. That means you should apply some of the same critical thinking you would use for traditional analysis when you scale it up to a machine learning approach.
With this in mind, what are some considerations when thinking about when and how to apply machine learning and deep learning to augment traditional statistical analysis?
The Size of the Data Set
As we alluded to above, most important is the size of the data. When you get beyond a certain number of rows and columns, traditional statistical methods don’t give us the capacity to identify all the correlations and understand what’s interesting or important. AI methods augment traditional approaches, as well as our own capacity as humans, by allowing us to analyze more data and identify insights or relationships that would otherwise remain hidden. To give one example, machine learning might allow us to build customer segments based on relationships between hundreds of variables versus just a few.
As a side benefit, machine learning algorithms can sometimes increase awareness that the available data sets do not provide enough information for accurate predictions. In this sense, they provide a way to identify gaps in data strategy and to decide whether new variables should be captured, perhaps from third-party data or from partners, to enhance a model’s predictability.
But working with larger data sets also opens the door to more algorithms, which these days are only a click away. That can be a problem, because we must still think critically and be selective about the algorithms and frameworks we choose.
Explainability vs. Performance
Another consideration is the level of explainability required for the problem you’re solving. If you’re using deep learning to control a self-driving car, you don’t need to understand the impact of every variable on the output (sometimes it is not even possible, if the model of interaction is unknown or computationally intractable). You only need a high degree of confidence that it is safe. But if you’re predicting whether a piece of machinery will fail, you need to understand the impact of each variable — temperature, humidity, rotation rate of the machine, etc. — otherwise the output isn’t useful because it doesn’t allow you to take corrective action.
Explainability is important in other contexts, too. If a model produces results that don’t align with intuition, then the model you spent time building may not get used. For instance, if a model predicts that a certain change will increase sales by 40%, but you can’t explain why that is, people may reject it because they don’t trust the result.
A further consideration is the ethical use of AI. There’s been much discussion about how bias can creep into the decisions made by computers. If the training data used for supervised learning is biased, for example, then the output is likely to reflect that bias. Explainable AI seeks to address this by ensuring that humans understand how the system arrived at a particular output. Again, the use case matters. If the model is recommending prison sentences for convicted offenders, ensuring a bias-free output is far more critical than if it is recommending which movie to stream next.
Selecting the Right Algorithm
Different algorithms have different advantages and disadvantages. For a simple example: In forecasting, you might choose between ARIMA, ARMA, Prophet, LSTM or something else. Which of these is best supported by the broader community? What are the feature differences across them? What are the limitations? And what biases might they inherently introduce? These are nuanced considerations that are not equally weighed and will depend on the business or analytical problem at hand. And that’s just for time-series forecasting. Working with domain experts and carefully considering the use case will be critical to choosing the most appropriate algorithm.
Mapping domain knowledge to the problem space is also an important factor when selecting the right algorithms. For example, you may find that there are multiple algorithms to handle classification. But classification is a diverse area, and what constitutes the right algorithm may vary depending on the nuance of an industry or use case. As a result, domain knowledge is important in helping to identify the most appropriate algorithm for the analytical question at hand. Conversely, not applying domain knowledge to select the appropriate algorithms can lead to meaningless results, even if the mechanics of the machine learning workflow have been executed well.
For most organizations, the true value is in the data and not in tuning the algorithm, so practitioners can accelerate their use of machine learning with AutoML tools. These tools allow you to test 20 or 30 algorithms against the same data and compare the performance of those models. That allows for a level of iteration and visibility we didn’t previously have. It’s a powerful way to test classic algorithms in a much more expansive way.
But there’s a trade-off here too. If five models produce similarly high performance scores, how do you decide which one is preferable? One option is to conduct A/B tests using a subset of the data to gather empirical evidence about which model truly has the best performance. This is a good general MLOps practice, both when a model is first created and whenever a model and/or the data is updated.
If two or more models seem to perform equally well, one solution is to use explainability as a tie-breaker. That’s because, in the long run, understanding feature impact and importance in the context of a model will ultimately be more useful in maintaining the model, updating it, selecting more training data and addressing model drift problems.
Applying AI techniques to data has greatly expanded the realm of what’s possible and how data can be leveraged to unlock business value. But there are many nuances and considerations to how and when AI models should be used. This starts early in the process. In many ways, tuning model parameters has become a commodity, and performance is driven by data selection and preparation. The cloud has given data practitioners immense power to leverage AI, but it needs to be applied thoughtfully and deliberately to reap the incredible benefits it can produce.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Velocity.
Feature image via Pixabay.