How Machine Learning Pipelines Work and What Needs Improving
Whether you’re developing a machine learning system or running the model in production, there’s an increasingly large number of data processing workflows. The pipeline runs from ingesting and cleaning data, through feature engineering and model selection in an interactive workbench environment, to training and experiments, usually with the option to share results, to deploying the trained model, to serving results like predictions and classifications.
The machine learning development and deployment pipelines are often separate, but unless the model is static, it will need to be retrained on new data or updated as the world changes, and updated and versioned in production, which means going through several steps of the pipeline again and again.
Managing the complexity of these pipelines is getting harder, especially when you’re trying to use real-time data and update models frequently. There are dozens of different tools, libraries and frameworks for machine learning, and every data scientist has their own particular set that they like to work with, and they all integrate differently with data stores and the platforms machine learning models run in.
When you ask data science and machine learning teams what tools they use the most, they mention Hadoop, Spark, Spark Streaming, Kafka, cloud services, R, SAS and a range of Python tools, even though they’re found at different stages of the pipeline. There are also cloud services like Google Cloud DataFlow and Azure Stream analytics that cover multiple stages of the typical pipeline.
Capturing data needs a durable, scalable, high-throughput data ingestion system — like Apache Kafka or cloud services like Azure Event Hubs and AWS Kinesis — that accepts data from a variety of sources and distributes it to the right place.
The next step is a data transformation tier that processes the raw data; filtering, aggregating, augmenting or consolidating it and then transferring it to a permanent data store. O’Reilly’s 2018 survey on the tools and technologies used for analytics and AI shows the popularity of Apache Spark for doing that data transformation, but it still needs a data store, whether that’s Hadoop Distributed File System (HDFS) and HBase, Apache Cassandra, cloud storage like Amazon S3 and Azure Blob Storage or other database storage.
It’s possible to process data for machine learning in-place, inside the database; databases like SQL Server and SQL Azure are adding specific machine learning functionality to support that. More common is streaming a subset of the data for processing; Spark has that built in with Spark Streaming which can read data from HDFS, Kafka and other sources but there are alternatives like Apache Storm and Apache Heron. Whatever else is in the pipeline, initial exploration of the data is often done in interactive Jupyter notebooks or R Studio.
For many, the big cloud providers have become the major data science platforms, Nishant Thacker, a senior product marketing manager at Microsoft told the New Stack; the Azure ML service, AWS SageMaker and GCP Cloud ML Engine, alongside SAS, RapidMiner and Knime. (If you’re not familiar with the last two, they consistently show up as leaders in Gartner’s Magic Quadrant for Data Science and Machine Learning Platforms.)
“Newer cloud platforms provide an open ecosystem and choice of frameworks like Spark, PyTorch, TensorFlow, Scikit-Learn and so on. Legacy platforms like SAS have a matured but sticky ecosystem around them,” Thacker said.
Data Lakes and Streams
One reason for the popularity of Spark is how much it includes, like the built-in MLib machine learning library. That’s convenient and easy to get started with, because it includes common learning algorithms and utilities and runs on the Spark Core, but it’s not always powerful enough for all needs — especially as you move from more traditional machine learning algorithms to deep learning. There are plenty of alternatives, like the H20 platform which has in-memory machine learning algorithms including deep learning that can be called from Python and R (and can integrate with Spark via Sparkling Water), as well as the plethora of machine learning tools and libraries.
“We definitely see that Spark is very common,” Streamlio marketing vice president Jon Bock told the New Stack, “but there’s an increasing diversity. It’s not so much that they’re moving past Spark to something else but there are more and more tools that people are leveraging.”
Part of that is the increasing popularity of Python for data science, which needs a different set of tools, but Bock says there’s also a bigger movement towards systems that deal with data more in real time, whether that’s processing IoT data, understanding a potential buyer’s behavior while they’re still thinking about choosing your product or service or delivering insights to customer support staff while they’re talking to the customer.
“We’re seeing some aspects of analytics processing being pushed into streaming systems; so Spark is often still used for training and exploration as people are developing their models but when they want to execute those models we’re seeing people look beyond Spark, and we haven’t seen broad adoption of Spark Streaming [for that].” Streamlio’s platform is based on Heron, the Apache Pulsar pub-sub messaging system and persistent message storage in Apache BookKeeper.
“People doing more advanced machine learning are going beyond just taking a batch of historical data and trying to build models that look only at that; they’re building models that can look at data as it arrives and that requires a more sophisticated pipeline,” Bock said.
“We went through a phase of five to ten years where a data lake was considered to be the cornerstone of your data pipeline, so your job was to get data into the data lake as quickly as you could and then figure out what to do with it — but we ended up creating this massive bottleneck in data engineering that ultimately made data scientists pull their hair out. What we’re seeing now a shift away from data lakes and more towards what if we now start doing the processing on the data as it arrives rather than first dumping it somewhere and having to come back and clean it up,” Bock said.
Problems of Scale
Even in 2019, most data scientists are doing their work on their laptops and are limited by the constraints of their hardware,” points out Thaise Skogstad, director of product marketing at Anaconda. “For datasets that do not fit on their laptop, they are still using traditional data lakes and running jobs, at great expense, on Spark or Hadoop which were ground-breaking a decade ago. However, most companies are now looking for a path to modern, potentially cloud-based solutions.”
The popularity of deep learning, where you’re more likely to be using a framework like TensorFlow or CNTK than a Spark workload, is making the bottleneck worse, notes Lachlan Evenson from Microsoft’s Azure Containers team, especially for large models like image recognition or problems in biochemistry. “Today they have purpose-provided hardware with GPUs that sit under their desk and doesn’t scale beyond that.”
“For legacy systems, scaling, flexibility of frameworks, choice of tools, leveraging newer infrastructure (GPUs, FPGAs) and ease of deployment are all challenges,” Thacker agrees. “For cloud platforms, orchestrating all of this into a single, easy to use form factor, and providing advanced capabilities like hyperparameter sweeping, DevOps capabilities, deployment flexibility and so on, without taking the data scientists away from their core job is a big challenge.”
The availability of IaaS (increasingly with GPUs and other accelerators) has been key for scaling machine learning so far, but this is now switching to cloud-native technologies like containerization and microservices for both scale and separation of concerns.
“For training and inferencing machine learning models you need very large compute. Kubernetes provides a very easy way to manage and scale this compute for ML models,” Thacker explains.” The predictability, reliability and affordability of Kubernetes makes it a no-brainer especially for inferencing needs, where the demand is unpredictable and auto-scaling is a requirement. As compared to the VM clusters or physical compute clusters, Kubernetes is easy to set up, flexible to scale and offers much more control.”
“Another reason why Kubernetes is important is that machine learning models usually execute in an environment with certain other assemblies and libraries or frameworks. To make this execution simpler, one creates a ‘container’ for the model and its dependencies to execute together. Containers are native to Kubernetes and can very easily be executed on Kubernetes clusters, thus providing the best of both worlds.”
As well as Kubernetes infrastructure management being an obvious fit for the operational stage of machine learning, the flexibility and scalability of Kubernetes is also becoming useful earlier in the pipeline Bock says. “Kubernetes is a very common deployment environment for Streamlio because people want to be able to scale up the ability to ingest data at faster rates, they want to be able to process data at fast rates and they want to scale up the ability to retain streams of data for later replay and training — and then back down again.”