Mobile Machine Learning: AI Offload Engines

Intel is pushing the idea that its new CPUs are the most cost-effective hardware to run machine learning and other AI workloads on, because you can also use them for other computing — making them more flexible than the GPUs that are mostly used for high-performance machine learning. But from mobile phones to cloud services, we’re seeing a wide range of AI-specific hardware that’s neither CPU nor GPU.
As Microsoft distinguished engineer Doug Burger explained to the New Stack when talking about how Azure uses FPGAs to accelerate machine learning, “There’s this great debate now about what is the right architecture? Is it soft logic, is it hardened, what are the data types and operators, what should run on CPU, what should run on this set of accelerators, what’s the system architecture? That’s all up for grabs.”
Mobile Machine Learning
Accelerators are showing up in phones, to speed up running trained data models for everything from face recognition to improving photo quality. Apple’s Core ML framework for iOS includes a number of pre-trained models that run on the bionic neural engine in the A11, as well as tools for converting models from frameworks like TensorFlow and Apache MXNet.
ARM’s upcoming Machine Learning and Object Detection processors are similar to Apple’s bionic neural chip. Rather than being general purpose processors or repurposed DSP chips, ARM designed them specifically to run the reduced precision integer and matrix multiplier accumulator arithmetic that makes up much of the machine learning workload, especially for inferencing — running the trained machine learning model rather than training it.
Developers can use the ARM NN SDK on Linux to translate Caffe models (and eventually models for other machine learning frameworks) to run on devices with ARM hardware, and if there’s an ARM ML Processor the code will be optimized for it.
Qualcomm’s Neural Processing Engine SDK covers rather more frameworks, although until Qualcomm has a platform that includes the ARM ML Processor it targets the Hexagon DSP for workloads like speech detection and the Adreno GPU for object detection and style transfer, using Caffe, Caffe2, TensorFlow, ONNX, CNTK and MxNet trained models.
Google has a custom Neural Processing Unit that it uses instead of the Hexagon DSP for machine learning in the camera app on the Pixel 2 and Huawei also created its own NPU (in fact its latest phone boasts dual NPUs). Third party applications will also be able to take advantage of those NPUs, but it will be worth investigating how much performance improvement they give before spending a lot of time supporting them.
ARM’s ML processors aren’t just for phones; they will show up in specific hardware devices like smart surveillance cameras and tablets. VA’s NeuPro AI processors are also designed for IoT devices, wearable, drones and AR/VR headsets; they have specialized matrix multiplication engines and a programmable vector DSP (which CEVA calls a Vector Processing Unit) that can be updated with new neural network algorithms. Imagination Technology’s PowerVR Series2NX is a neural network accelerator designed for smartphones, smart cameras and self-driving cars that are optimized for processing tensor operations like convolution, activation, polling and normalization. It works with the Android Neural Networks API (NNAPI) and developers can use the Imagination neural network SDK to convert models from frameworks like Caffe.
Servers and Clouds
Nvidia has also been adding tensor processors to some of its graphics cards to accelerate deep learning. In the cloud, Google has a custom ASIC Tensor Processing Unit that calculates both floating point and integer tensors using 8-bit reduced precision.
Reduced precision (or as Microsoft calls it, narrow precision) uses fewer transistors per operation, making it more efficient for calculating lower precision numbers, and deep neural networks use lower precision weights for speed. “You get a superlinear jump in efficiency when you do that for inference,” Burger told us.
For training neural networks that are moving towards 16-bit precision for both integers and floating-point numbers, but for inference when you’re running a trained model that’s moving from 32 or 16-bit floating point representations to 8-bit integers because it improves instruction throughput and reduces memory consumption without too much loss of accuracy.
As Naveen Rao, the general manager of Intel’s AI products group, explained, “Lower precision allows more parallelism on a chip because more of these operations happen at the same time, at lower power and do more without degrading the algorithmic performance like the classification rate or how well it’s translating speech to text. You’re not degrading that, while also achieving higher levels of parallelism at lower power.” It’s also becoming common to mix numeric precision to achieve performance and accuracy.
Intel has already added matrix operations to the Advanced Vector Extensions that supplement the x86 instruction set in the latest Xeon CPUs. The Cascade Lake Xeon that ships this year adds a new vector neural network instruction, DL Boost (PDF), that can handle the 8-bit integer convolutions common in deep learning inferencing with fewer instructions. The 2019 Copper Lake Xeon CPUs will have a new version of DL Boost that uses Google’s bfloat16 16-bit floating point format for training workloads, along with the Vector Neural Network Instruction (VNNI) set extension that supports 8-bit multiplication and 32-bit accumulation.
Intel’s forthcoming “neural processor,” the Nervana NNP-L1000 neural processor, which will ship in late 2019, has its own new mixed precision numeric format called Flexpoint which turns scalar computations into fixed-point multiplications and additions.
Microsoft put an FPGA in every server in Azure to run machine learning services on, which are now available to customers to run ResNet on, and worked with HPE and Dell to offer FPGA-equipped servers to run its Cognitive Services at the edge. These Project Brainwave systems have a specialized instruction set optimized for dense matrix multiplications, vector operations and tensor operations like convolution, non-linear activation and embeddings, using Microsoft’s own narrow precision floating point formats, ms-fp8 and ms-fp9, wrapped in float16 interfaces.
But as with all of these AI accelerators and offload hardware, developers aren’t going to be working down at the level of the instruction set; they’ll carry on using frameworks like TensorFlow and MXnet and CNTK, which will take advantage of the more efficient numeric formats and instructions when they’re available.
Don’t Forget to Benchmark
Hardware vendors make claims about how much faster AI-specific hardware can be, but at such an early stage in the market, developers may want to do more testing themselves to find out how significant those are, as well as which accelerators have the most benefit for the models they’re using.
When working with one OEM, Microsoft was asked to port one of its Azure Cognitive Services APIs to run using the NPU on the OEM device. After doing the work, the team benchmarked the service on the device. “We literally found that we did just as well on the CPU as we did on their neural processor,” Microsoft Principal Group Program Manager Andy Hickl told us.
In the end, they left it up to the OEM to decide. “We said, our algorithm runs on whatever compute platform you’re going to provide,” Hickl said. That’s similar to Rao’s claim that for inferencing “the performance gap between general purpose computing and specific computing like GPU isn’t an enormous gap like 100 times, it’s more like three times”.
Increasingly, tools will attempt to optimize machine learning models and systems for the available hardware, but many of those tools come from hardware vendors like ARM and Intel rather than being cross-platform. Intel’s open-source nGraph compiler promises to optimize models TensorFlow, MXNet, neon and ONNX (which adds support for CNTK, PyTorch, and Caffe2) to run on Xeon, Nervana, Movidius (specialized vision processing chips), Stratix FPGA and GPUs. That doesn’t help if you need to run the same machine learning model on a mobile device — or even an AMD CPU — so we look forward to seeing it become more cross-platform to really help developers, or to seeing a common processing pipeline that can use nGraph or other optimizers depending on what hardware is being targeted.
This is a fast-moving space. With such a wide range of AI offload hardware and acceleration taking so many different approaches, developers can expect to have to adopt a wider range of tools to take advantage of them and it’s worth doing some benchmarking before investing time in supporting particular hardware.
Microsoft is a sponsor of The New Stack.