The field of Machine Learning continues to move fast in terms of research and what organizations with expertise can achieve but is it maturing in the sense of becoming easier for the mainstream to adopt, operate and get value from?
The iterative process of experimenting with data models to get the one that works best with your data is time-consuming and expensive in terms of training time on GPU-powered infrastructure that takes effort and expertise to set up and operate. The founders of startup Determined AI met at the AMP Lab at the University of California, Berkeley where they worked on distributed machine learning for Apache Spark and created the Spark MLlib scalable machine learning library. The idea is to make it easier to create the ‘AI-native’ infrastructure organizations need to power the kind of deep learning development that’s currently the preserve of companies like Google, Apple, Amazon and Microsoft, co-founder Evan Sparks told The New Stack.
“They have super-smart technologists working for them; they have hired armies of PhDs who understand how machine learning works. They’ve invested heavily in hardware: Google is probably NVidia’s number one customer on the data center side. And then they’ve built out these incredible internal software libraries and capabilities as tooling to enable internal developers to deliver these solutions really, really well,” Sparks said. “Contrast that with the experience of a developer in the global Fortune 2000 who has to build one of these machine learning-powered applications end-to-end and it’s like they’re banging stones together to create fire. They’re handling a lot of this stuff by hand; they don’t have good centralized services for scheduling GPUs or distributing training or other things that the engineers at those companies have.”
Even if those companies open source more of their internal tools, they were written for engineers with that expertise and those resources and infrastructure; they’ll also encode assumptions about how things are done that may not fit other organizations or even complicate the situation. “Take TensorFlow serving. You think, ‘great, I can deploy this in a container behind a web service and it should spit out some reasonable metrics about how my model is performing.’ Then you dig into what those actual metrics are and it’s things like inference latency and how long takes to load the libraries but nothing about my machine learning performance, so that ends up being stuff people bolt on themselves and do in an ad-hoc way.”
An end-to-end platform can offer more performance as well as an integrated experience, Sparks believes. Useful as it is, he notes that Kubeflow is “a motley collection of tools that happen to be in the same space that are all open source that are put together in a distribution but are not designed with each other in mind”.
“This is a very new product area so there’s an opportunity to create a holistic design for how these various components are supposed to work together. If you design the scheduler separately from how you design your hyperparameter training service you’re going to leave a lot on the table in terms of performance. We’ve taken the approach of let’s think through what is the common set of services that every deep learning engineer needs, let’s build those with each other in mind, on top of a robust distributed system.”
While the business problems and the constraints in different domains may vary, Sparks believes data scientists tend to have very similar workflows that Determined AI can abstract in its tools, starting with model training and the iterative development of models. Some of that relies on work the founding team had already done in hyperparameter optimization that’s up to 15 times faster than basic methods — and five times faster than Google’s internal system on common problems like architecture search and neural network-tuning.
“If you’re trying to actively debug a TensorFlow model, and you have to wait six hours to see if you hit the same bug, that’s not very productive.”
The Determined AI platform also handles scheduling, utilization and resource sharing: “How do you share your GPUs across researchers in a way that’s fault-tolerant, and where they can get immediate access to these devices? How do we enable rapid training, retraining and exploration of these models?” Using cloud GPUs can be ten times more expensive than buying and operating GPU hardware yourself, he estimates; there’s an even bigger differential using consumer-grade GPUs — “which I’m not recommending, but a lot of people are doing anyway.” Statically allocating resources even when you only have two or three data scientists isn’t efficient, especially when it’s done manually and informally.
“That’s causing [organizations] to dramatically underutilize resources. It’s causing a whole bunch of pain and fighting: Oh, I wanted to use that to run my experiment. Oh, a high priority thing came up, can you Control-C your job and let me get onto the cluster?”
Where organizations have moved on from manual scheduling, Sparks often sees them using queuing systems designed for high-performance computing (HPC) that run long jobs, one after another, which doesn’t fit the iterative development needed for machine learning models. Even more modern generic orchestration systems like Kubernetes or Mesos can’t use metrics about machine learning models that would allow more granular scheduling.
“If you’re trying to actively debug a TensorFlow model, and you have to wait six hours to see if you hit the same bug, that’s not very productive. Or it’s ‘here’s a big job that takes a week to run, we’ll reserve those resources for that entire week and let that job run, and then when it’s done, we’ll release them.’ We’re able to make scheduling decisions at a much higher granularity. As jobs come in, our scheduler handles time-division multiplexing some of these resources, it takes care of fault tolerance, it takes care of making it very easy to add in additional GPUs, including cloud GPUs, while jobs are in play and schedule them as additional resources.”
Resource allocation for distributed training is also automated and abstracted. “If I issue a SQL query in a database, I don’t think about the order JOINs happen in,” he points out. “I should just submit my model definition and its architecture, and the data set and the system should worry about how to parallelize it and allocate resources effectively to right-size it for this job.” Because of the way machine learning scales, he says that’s the best way to maximize throughput and minimize training latency.
“It turns out these jobs don’t just scale infinitely; you can’t throw more Xeons at them and hope the model training is going to go faster, because there are communication bandwidth issues and bottlenecks. The right number of GPUs for AlexNet is different than Inception which is different than ResNet. Right now, developers are figuring that out by trial and error, and our system can offer a bunch of guidance for them. “
Keeping the pipeline filled with data is also key for training efficiency but storage I/O can be another bottleneck especially for enterprise networks using the Hadoop File System (HDFS) — or cloud storage. “Today, people are hand-copying their data sets down to local SSD storage, right next to the GPUs. And they’re doing this on a one-off basis for every project that they’re training, and they’re managing that by hand. It’s a fairly simple caching problem we can solve for them.”
Orchestrating and scheduling jobs and resources means the Determined AI system is gathering a lot of metadata, from model metrics to the versions of code developers are using. “We build that into a system of record for model development,” Sparks said.
That’s part of the way the different pieces of the system feed into each other as you move through different iterations of a model, producing end-to-end efficiency. “it’s never the case that the first training job you run is a model that actually works: You’re doing this for months at a time in a deep loop and that’s where hyperparameter tuning and architecture search come in. It’s important for these services to be aware of the mechanics of how your scheduler works: you want them to be aware of the underlying hardware they have at their disposal, and the budget, and so on. And if you can make them resource-aware you can get real wins over just doing grid search or something like that by hand.”
Deployment, especially to resource-constrained mobile devices, also needs to be part of the system. If you’re deploying models on the same hardware you train on, the scheduler can allocate a subset of GPUs for inferencing or batch rescoring. “I trained on million instances, but now I want to score a billion instances to rewrite my search database and I want to make progress on that but at the same time keep making progress on model development.”
Compressing models for mobile inferencing is really a complex retraining workflow, he says. “It’s about going from a fancier model to something that actually works on these devices.”
“Researchers spend all their time worrying about how do I get this thing to the highest accuracy possible?” Sparks pointed out. ResNet 150 can deliver 98% accuracy. “That seems great, only then you find out it has 12 million-plus parameters and it takes a high-end Nvidia GPU to even render a prediction in under 500 milliseconds.” That’s a problem for a customer like Mythic AI [https://www.mythic-ai.com/technology/] who has an inference accelerator the size of a shirt button designed to run at very low power, or a gene sequencing hardware supplier who’s limited to the hardware already deployed in labs.
“They were running things like decision trees and logistic regression to decide what data to keep them what to throw away and by switching to deep learning in the lab, they got a 50% reduction in the error rate of a particular part of their pipeline, which is amazing for them. But the model was 10 million parameters and they needed to get it down to 4,000 to fit in the memory footprint requirements they had.”
Determined AI will automate techniques like quantization, pruning and distillation to reduce model sizes. It also has some experimental tools for architecture search to show the trade-offs of compressing a model to fit into a specific device. “You can give a description of the model architecture and the hardware you want to run on: the memory footprint is going to be this, the model latency is going to be this, the power consumption is going to be this. That can allow people to hone in on models that are small enough to meet their desires, and we give the trade-off: if I do get it small enough to fit on this device, where is the recall and precision?”
In some scenarios, data scientists won’t need to change their workflow to take advantage of Determined AI. If you use frameworks like Keras, you might be able to submit jobs to the platform through its web UI without needing to modify them. To do distributed training and use the model analytics, though, you need to use a lightweight API and REST endpoints to specify some parameters about your model.
In future, the platform will do more storage optimization, to give faster data access for training, as well as offering feature cataloging for trained models, and potentially automatic model compression, as well as deploying, monitoring and retraining models.
“We also want to move up the stack towards collaboration and services,” Sparks explained; “Supporting more sophisticated workflows, where the goal is not just ‘train me a model and tell me what its accuracy is’ but thinking holistically about that model development process and how that’s going to fit into the rest of the business. It’s about taking [machine learning development] from research insight to ‘I need to get this running 24-7 in my data centers and when I need to retrain my models it’s to an SLA that’s crucial to my business.’”