Machine Learning at Scale: Getting Models to Production
For organizations, getting machine learning (ML) models to production and into the hands of customers has been a significant challenge. According to reports in Venturebeat and redapt, almost 90% of the models that data scientists build never make it successfully to production.
This is extremely unfortunate because machine learning initiatives are supposed to solve real-world problems. If a machine learning model fails to reach customers at scale, it fails to:
- Take a shot at the problem it was intended to solve.
- Get feedback about the accuracy of the solution and hence improve with time (the essence of ML).
There are several reasons why getting models to production is a challenging task. I will highlight and elaborate on the top four issues below.
Issue 1: Background and Training
Data science is an evolving field that ushers in candidates from diverse backgrounds. The tools used to train data scientists in boot camps and in academia to build models are often a poor fit for production use. It’s usually up to the individual to learn specific tools for tasks like containerization (e.g. Docker) or API creation (e.g. Flask). While these tools show promise in a data scientist’s environment, it’s sometimes not clear for them how to deploy into production. Furthermore, parallelization or other kinds of performance-efficiency coding require complex system architectures, and data scientists do not necessarily learn these skills before they join industry roles.
Organizations have tried to solve this problem in two ways: 1) hire a full-stack data scientist, or 2) create a team that pairs data scientists with software engineers. Full-stack data scientists are unicorns who can take a data-driven concept from identification/ideation to prototype building and execution. They are also responsible for measuring the impact of their data science feature and continuously improving it over time, with a heavy emphasis on addressing challenges in the production environment. These are the senior data scientists who have gone through the journey and have increased their skill sets over time, learning new tools and encountering production challenges at different organizations.
“If you’re concerned about scalability, any algorithm that forces you to run agreement will eventually become your bottleneck. Take that as a given.”
— Werner Vogels, CTO Amazon
A hybrid approach seems to work best where there are few full-stack engineers and a combination of data scientists and data engineers with software engineering skills in a data science team. Even hiring one full-stack engineer is a considerable challenge. An alternative option might be upskilling data scientists with cloud platform (architect) certifications and getting teams focussed on delivery. Even though not every data scientist might enjoy software development-driven learning, it is increasingly becoming a necessary skill set, particularly if data scientists want to get their models running in production without having to depend on others.
Issue 2: Collaboration of Two Different Worlds
Getting a model in production requires data scientists and engineering teams to collaborate. Data scientists want to work on business problems that are innovative and challenging. Once they have built a prototype, they often want software engineers to take care of the next steps. On the other hand, software engineers are experts in building front ends, wiring models into production code, dealing with continuous deployment issues and packaging. As a best practice, it’s incumbent on senior data scientists and engineers, or sometimes an SRE team, to set up the right frameworks to help data scientists successfully deploy in production. They don’t need to invent the entire process from whole cloth every time. However, data scientists and software engineers must collaborate on this transition period to exchange perceived payload information, model tuning, accuracy measurements and other parameters.
Accordingly, automation plays a vital role. It is crucial to understand which parts of the deployment process are generic and can be automated and which components are specific to a particular data science feature. For example, in AWS, a Sagemaker environment to create an API handshake point is a generic framework, but defining a payload is specific to the project. We have built accelerators, or automation, around these generic touchpoints, ensuring team collaboration is required only for customized actions.
Issue 3: Real-Time Performance and Accuracy Might Deviate
Real-time performance and accuracy might deviate substantially from offline testing. When data scientists create a model, they train the model with historical data. They also check the accuracy of the model on a subset of training data. Hence, when organizations release the model to new, real-time customer data, even if it is to a handful of customers, the model results are not always as expected and require carefully monitored experiments to ensure that the trained model is performant and accurate. To create minimum uncertainty at this stage, data scientists will need to:
- Create products that are incrementally agile (can be improved in steps) and create metrics to measure performance before a product is released.
- Release data science products using a percentage-based rollout, or in phases — alpha, beta, etc., moving in a controlled incremental way.
For example, in our AIOps data science feature Intelligent Alert Grouping, which is part of PagerDuty’s Event Intelligence (intelligent event management) product, the backend algorithm is built to observe actual alert data and incident history, and adapt as it sees new alerts. The model is based on reinforcement learning (automated learning) and hence the success rate is specific to the scale and nuances of customer data. Although it was easy building a prototype based on historical data, it was pretty complex to tune our algorithm once we released it to the first batch of early-access customers. In this phase of data science product development, we need data scientists to research and understand how to tune available model parameters such that accuracy does not change unless there is a significant change in the underlying data. Only then can the ML model improve over time so customers can realize the benefits of ML without constant intervention by data scientists.
Issue 4: Ever-Changing Data Requirements and Solutions
The Big Data and machine learning landscape is changing continuously. At the beginning of the Big Data revolution, it was mainly the volume of data that was critical, while now it is also velocity and variety, so tools data scientists use in production must adapt. For example, relational databases have dominated the landscape as a commercial solution for data storage and data manipulation for decades. However, today we need real-time data streaming, persistent data stores and fast-access data stores to handle structured and unstructured data.
To solve this, organizations need to codify the access patterns for different categories of data (transactional, non-transactional, etc.) and invest in more than one type of data store that is fit-to-purpose, rather than a one-size-fits-all choice.
The pursuit of machine learning and AI-driven decisions can make business leaders unrealistic sometimes, believing that machine learning and artificial intelligence can transform their business instantaneously or in a brief period without adequate investments. Data scientists, engineers and business leaders need to have a healthy discussion about:
- What’s possible, how long it will take and the associated risks.
- What are the necessary investments required, such that data scientists and engineers can deploy models to production at scale.
Successful deployments of machine learning/AI technology are, at their heart, cross-functional. Teams with different skill sets must be encouraged to collaborate, automate generic work and take well-calculated, real-time risks to drive machine learning model results. Doing so will also surface organization-specific nuances that various teams, not just data science, will have to adapt to. We hope you have been able to learn some lessons from our experience of successfully putting machine learning models in production at scale.