Machine Learning / Technology

Apache TVM: Portable Machine Learning Across Backends

28 Dec 2020 6:00am, by

The Apache Software Foundation’s newest top-level project, TVM, aims to bridge the gap between the creation of machine learning models and launching them into production. It automates the time-consuming work of tuning models to various backend hardware, specifically CPUs, GPUs, and specialized accelerators.

“There are a lot of applications nowadays that require putting artificial intelligence machine learning onto our back end including your mobile phones or data center devices, and so on. And then the challenge that we face is that it’s really hard to build a solution that’s portable,” explained Tianqi Chen, vice president of Apache TVM. Those applications include speech recognition, auto-correcting cameras and more.

Big companies like Google, he said, just hire a lot of engineers to optimize machine learning applications on different platforms, but most companies don’t have the resources to do that.

Apache TVM was a project to create “a single, clean path from the [model creation] framework to the target platform of choice in an automated fashion,”explained Luis Ceze, professor at the University of Washington and CEO of OctoML, during the recent 2020 TVM Virtual Conference.

TVM originated within the SAMPL interdisciplinary machine learning research group at the University of Washington in 2017 and entered the Apache Incubator 2019. The company OctoML is the commercial arm of the open source project. The number of contributors to the project has grown by 50% in the past year, Ceze said at the conference. Those contributors come from places like Qualcomm, Google, Huawei, AMD, Cornell University, UC-Berkeley, and Carnegie Mellon University, where Chen is now an assistant professor.

TVM works with deep learning frameworks — Keras, MXNet, PyTorch, Tensorflow, CoreML, DarkNet — to provide end-to-end compilation to different backends, including browsers, microcontrollers, FPGAs (field-programmable gate arrays) and more.

It provides minimum deployable modules that can run on an array of systems and devices including mobile phones, wearables, specialized chips and embedded devices.

It’s used in production at companies like AWS, Facebook and Alibaba.

TVM takes inspiration from Halide, a programming language for writing high-performance image and array processing code for CPUs and GPUs; the integer set analysis and loop transformation primitives of Loopy; and the Python library Theano.

The TVM stack includes a high-level differentiable programming IR (intermediate representation) for high-level optimization, a machine learning-driven program optimizer and VTA – a fully open sourced deep learning accelerator.

The technology can begin with multiple front ends, including Tensorflow, ONNX and MXNet. The front end ingests a model into a IRModule (intermediate representation module), then transforms it into a functionally equivalent version, depending on the target back end. It is translated to an executable format specified by the target, then encapsulated as a runtime.Module that can be exported and run back on the target hardware. The compiled artifact is written in the language of choice, including Python, C++, Rust, Go, Java, and JavaScript.

“The idea is I want to bring more automation, instead of having a programmer go and optimize those libraries, all those platforms. In this case, we [use] machine learning-guided search to search possible candidates to find a good solution. So the whole idea is that we will be able to take ML models generated by common machine learning frameworks, and automatically find optimized past programs for the back end of interest,” Chen said. The technology also can look for optimizations to speed performance and predict the best course of action.

It uses an iterative loop for kernel optimization. After intake, which typically takes the form of a computational graph representation fashion, it generates kernels for all operators in this network.

“The inner loop uses a scalable RPC runtime, machine learning-based tuners and a tensor compiler. In each round of the loop, the tuner picks a batch of promising candidate kernel implementations and profiles them on real hardware. Then the tuner gets the profiling results. These profiling results are used as training data to fit a prediction model. After fitting the prediction model, the tuner picks the next promising candidates according to the predictions, and the loop continues. his way, we search for fast kernels iteratively,” a blog post explains.

The project closest to TVM is Google’s MLIR (multilevel-intermediate-representation-overview) for building reusable and extensible compiler infrastructure. Other projects for tensor computations include Halide and TACO (Tensor Algebra Compiler).

Auto-scheduling has been among the features under development while in the incubator. The path forward, of course, will be determined by the project’s contributors, Chen said, but he expects it to support more and different kinds of hardware and to uncover new areas of research.

Already TVM has been extended to feature a microcontroller backend, called µTVM (pronounced “MicroTVM”), which facilitates host-driven execution of tensor programs on bare-metal devices.

Image by J Garget from Pixabay.

A newsletter digest of the week’s most important stories & analyses.