How Kubernetes Could Orchestrate Machine Learning Pipelines
As a scalable orchestration platform, Kubernetes is proving a good match for machine learning deployment — in the cloud or on your own infrastructure.
The cloud is an increasingly attractive location for machine learning and data science, because of the economics of scaling out on demand when training a model or serving results from the trained model, so data scientists aren’t wasting time waiting for long training runs to complete. Ovum has been predicting that in 2019 half of all new big data workloads would run in the cloud and in a recent survey, some 45 percent of organizations said they were running at least one big data workload in the cloud.
That can mean cloud machine learning platforms like Azure Machine Learning Studio, Amazon SageMaker and Google Cloud AutoML that offer built-in data preparation tools and algorithms, or cloud versions of existing tools like Databricks (for running Spark workloads on Azure or AWS) or the upcoming Cloudera Machine Learning service, a version of Cloudera Data Science Workbench that will run on public cloud Kubernetes services.
Orchestrating Machine Learning
The reason Hadoop and Spark have been so popular for data science (and following that, for machine learning) is that they use clusters and parallel processing to speed up the parallelizable parts of data processing pipelines. They’re dedicated software stacks where clusters are managed with the project’s own cluster management solution, like Apache Yarn or Mesos Marathon.
But as Kubernetes has become increasingly popular as an orchestrator to create scalable distributed systems, it’s starting to look increasingly attractive as a way to get the flexibility that data scientists want to use their choice of different machine learning libraries and frameworks, the scalability and repeatability that the team running machine learning systems in production need — with the control of resource allocation (including GPUs for fast training and inferencing) that the operations team requires. Those are the problems Kubernetes already solves for other workloads, and now it’s being applied to machine learning and data science.
Instead of separate data science and deployment paths, where data scientists build experiments with one set of tools and infrastructure and development teams recreate the model in a production system with different tools on different infrastructure, teams can have a combined pipeline where data scientists can use Kubeflow (or environments built on Kubeflow like Intel’s open source Nauta) to use Kubernetes to train and scale models built in frameworks like PyTorch and TensorFlow on Kubernetes without having to be infrastructure experts.
Instead of giving everyone their own infrastructure, with expensive GPU systems tucked under the desk, multiple users can share the same infrastructure with Kubernetes namespaces used to logically isolate the cluster resources for each team. “Distributed training can make the cycle of training much shorter,” explained Lachlan Evenson, from Microsoft’s Azure Containers team. “You want a trained model with a certain level of accuracy and data scientists are changing the model until they get the accuracy they want but with large data sets it takes a long time to train and if they don’t have the infrastructure to scale that out, they’re sitting around waiting for that to complete.”
“In recent years, the price of both storage and compute resources has decreased significantly and GPUs have become more available; that combined with Kubernetes makes machine learning at scale not only possible but cost-effective,” said Thaise Skogstad, director of product marketing at Anaconda. “Platforms like Anaconda Enterprise combine the core ML technologies needed by the data scientists, the governance demanded by IT departments, and the cloud native infrastructure that makes running ML at scale possible.”
Once trained, the model can be served on the same infrastructure, with automatic scaling and load balancing; NVidia’s TensorRT Inference Server uses Kubernetes for deployment of TensorRT, TensorFlow or ONNX models. There’s the option of bursting up to a cloud Kubernetes service for training or inferencing when you need more resources than your own infrastructure does. OpenAI uses a mix of Azure and local Kubernetes infrastructure in a hybrid model with a batch-optimized autoscaler.
Machine learning developers use a wide range of frameworks and libraries; they want to get the latest versions of the tools to work with, but they might also need to use one very specific older version on a particular project so that needs to be available in every environment. And as you move from development to deployment, you can end up with different versions of the same model running in different environments. That causes problems for reproducibility as well as orchestration and scalability, especially if it’s complicated to update to a new model or revert to an older one if an experiment wasn’t successful.
Without reproducibility, it’s hard to trace whether a problem is caused by the pipeline or the model. But if you can reliably deploy your model and its data pipeline into production, packaging them as microservices that expose an event-driven API other systems can call, it’s easier to make components modular so they can be re-used or abstract services so you can support multiple tools and libraries.
“We’re seeing a big movement towards thinking of individual models or sub-models deployed as a service as opposed to a complex monolith running all in one environment and more complex ensemble models could be calling those services and combining those results,” said Streamlio marketing vice president Jon Bock.
Bringing together different languages, libraries, databases and infrastructure in a microservices model needs a fabric that provides reliable messaging, deployment and orchestration. The team running that model in production will also need to orchestrate the production environment and allocate resources to different models and services, with demands that might change seasonally or even throughout the day.
This is an emerging trend; in our 2017 Kubernetes User Experience Survey, 23 percent of respondents were running big data and analytics on Kubernetes and in Heptio’s 2018 report on The State of Kubernetes that rises to 53 percent running data analytics and 31 percent running machine learning. Bloomberg is building a machine learning platform for its analysts on Kubernetes. And when Microsoft wanted to deliver its real-time text to speech API fast enough for chatbots and virtual assistants to use it in live conversations, it hosted the API on the Azure Kubernetes Service.
Using Kubernetes for machine learning might not mean changing your pipeline as much as you think. You can already run Spark on Kubernetes using the native Kubernetes scheduler added in Spark 2.3 (which is what Cloudera is using for its new cloud service). The scheduler is still experimental but Spark 2.4 adds support for Python and R Spark applications on Kubernetes, and interactive client applications like Jupyter and Apache Zeppelin notebooks that give developers reproducible sandbox environments can run their computations on Kubernetes. The Google Cloud Platform (GCP) already has a (beta) Kubernetes Operator for Spark to manage and monitor the lifecycle of Spark applications on Kubernetes.
The Apache Hadoop community has been working on decoupling Hadoop from the Hadoop File System (HDFS), to allow Hadoop to work with cloud object storage via Ozone which is designed for containerized environments like Kubernetes, to run HDFS on Kubernetes to speed up Spark when that’s running on Kubernetes, and to run Hadoop itself on Kubernetes.
There are also a host of new tools and frameworks for machine learning that rely on Kubernetes for infrastructure and model deployment at scale. This is definitely more work than using a cloud machine learning service, but it means data science teams can pick from a wider range of languages and models than a particular cloud machine learning service might support while the organization gets more choice about where to deploy and run models so they can balance the requirements and cost of running them.
If that sounds familiar, it’s because machine learning pipelines involve the same kinds of continuous integration and deployment challenges that devops has tackled in other development areas, and there’s a machine learning operations (“MLops”) movement producing tools to help with this and many of them leverage Kubernetes.
Pachyderm is an end to end model versioning framework to help create reproducible pipeline definitions, with each processing step packaged in a Docker container. MLeap is a framework to help serialize multiple learning libraries, so you could use Spark and TensorFlow against the same data layer through an MLeap bundle. Seldon orchestrates deployment and servicing of machine learning models, packaging them in containers as microservices and creating the Kubernetes resource manifest for deployment. ParallelM MCenter is a machine learning orchestration and monitoring platform that uses Kubernetes to scale model deployment.
Platform approaches like Polyaxon, MFlow, Daitaku and the Domino Data Science Platform aim to cover the whole pipeline and lifecycle from experimentation to deployment and scaling, again with Kubernetes as a deployment option. Lightbend mixes Spark and SparkML with TensorFlow for making event-driven, real-time streaming and machine learning applications, with Kubernetes as one of the deployment options. Streamlio’s Community Edition for building real-time data analytics and machine learning is available as a Kubernetes application on GCP for fast deployment.
Despite how useful Streamlio finds Kubernetes, Bock cautions that because Kubernetes is a relatively young project that’s still developing fast, adopting it for machine learning may mean more work to do keeping up with those changes. “The options you have for how storage is handled in Kubernetes are rapidly evolving for example, and that does have implications for how you build apps because you might find you have to rebuild them because now there’s a better way.”
For instance, if you’re storing your machine learning data set in cheap cloud storage, there’s a new project called Nezha that’s an on-demand cache for Kubernetes that speeds up training performance by prefetching data from GCloud Storage, S3, Azure Data Lake and Storage Blobs.
If you’re new to Kubernetes, you’ll also need to think differently about how to access data, he points out. “The way you get data in and out of a Kubernetes environment is definitely different if you’re used to logging into the machine and grabbing the logs.”
The MLops tools are also relatively new and not yet widely adopted, Bock points out. “We need more maturity and more innovation on the tooling side of how you deploy environments; today a lot of people are writing custom configuration files like Helm charts to define an environment for training but it should be easy to specify that environment and have it deployed in a repeatable way without having to go down to the level of changing lines of script.”
Best practices are still emerging, but Kubernetes is becoming established as one of the options for how you mature your practices for building data science and machine learning pipelines.