As devices get smarter and more tools use machine learning, mobile processors have started adding dedicated cores to do that machine learning on the device, where the data and the application are, using a pre-trained model offloaded to the device, especially for time sensitive AI tasks like classification and tracking.
This approach is a tradeoff between the performance of using a more powerful processor to do machine learning in the cloud, against the latency of transferring data and decisions to and from the device, and the battery impact of running machine learning on a mobile device, Jem Davies, general manager of the ARM Machine Learning Group told The New Stack. It also offers more privacy for sensitive personal data; something that users and regulators are increasingly concerned about.
To make this more efficient, ARM is introducing specific processors designed “from the ground up” for machine learning, under a new initiative called Project Trillium. “We want high levels of machine learning performance with high levels of power efficiency. This is a design specifically targeting inference at the edge that gives us a massive uplift over traditional CPU and GPU architecture and over the sort of thing that’s being done on digital signal processors [DSPs],” he explained.
ARM ML includes a machine learning processor, an object detection processor and neural network software libraries “optimized to run on a range of ARM platforms up to the Cortex-A and Mali GPU.”
The object detection processor detects not just people but the direction they’re facing in, the trajectory if they’re moving (for tracking across different frames of video) and gesture and poses, in full HD at 60 frames per second; it’s the second generation of the technology already available in the Hive Camera. It can also be used to pre-process video going to the new Machine Learning processor and isolate areas of interest like faces for face recognition or hands for gesture control.
The ML processor isn’t restricted to image processing; it can run any machine learning workload that relies on inference — so it’s designed to take data models trained offline and run them in real time, whether that’s keyword detection for a smart speaker, pattern training in a smart thermostat, voice recognition, autonomous driving or data center monitoring.
ARM promises the 4.6 trillion ops per second mobile processors already deliver, with a further 2-4 times improvement in performance because the processor is designed specifically for machine learning, but on a power budget of 3 trillion ops per Watt; a combination Davies claims is “considerably in excess of anything else in the market.” To achieve that, the processor is optimized for the reduced precision integer and matrix multiplier accumulator arithmetic used in machine learning inference workloads, with special units for the operations in convolutional neural networks. It also uses what Davies calls an ‘intelligent memory system’ instead of a traditional fetch and decode memory architecture which uses more power.
“What we need to do is reach those level of power efficiency and higher performance using special optimized units for special functions but we also have to be clever about what data we load and use. A traditional architecture — CPU, GPU or DSP — is going to involve a lot of intermediate results storing and loading from memory, so we have produced a new form of operation that avoids that step. This is designed around the intelligent memory system that avoids temporary stores of intermediate results. We try to never reload a piece of data; if you have to reload an intermediate result you’ve failed. Once it’s loaded we try to use it as much as possible before storing it.”
Simplifying Machine Learning on Multiple Devices
ARM has also released ARM NN, a set of software optimized for the new processors that includes intermediate libraries to support a handful of the common machine learning frameworks (although not as wide a range as is supported by Qualcomm’s Neural Processing Engine SDK). “We augment the neural network frameworks that developers are comfortable and experienced with,” Davies told us.
“There are perhaps 20 frameworks in use today; the most common are TensorFlow, Caffe, Caffe2, MXNet and the Android NN API and two of those account for 80 percent of the instances.” He predicted some “coalescing and merging” (although ARM isn’t yet supporting the Open Neural Network Exchange (ONNX), the format that allows developers to change easily between frameworks). “We will support those partners who want the frameworks they want,” he promised, so the list may expand.
Some smartphone makers have already created custom “neural processing units” for their devices, like Google’s Pixel Visual Core, the Huawei NPU in the Kirin 970 chipset, and the iPhone’s Neural Engine (one of the custom signal processors in the A11 Bionic SoC, which performs tracking and object detection). Third-party apps can use those but most device manufacturers won’t have the expertise to design their own NPUs, and having an ML core that’s part of the platform means developers don’t have to adapt smart applications to the different NPU on each device.
“I like ARM’s approach as developers can write to one API and the workload will use the resources at its disposal, whether it be ARM CPU, GPU or discrete ML accelerator,” Patrick Moorhead, president of Moor Insights & Strategy told us.
Davies also suggested that this approach will let developers run machine learning workloads on classes of devices that haven’t previously had the capability to support them, meaning that they might be dealing with a much wider range of hardware, “but they don’t need to understand the complexity of differing hardware underneath.” We’ll see these processors first in premium smartphones with 7nm processor technology, but they’ll be added to more entry-level and mass-market processors in time.
Gartner is predicting over a billion units of smart surveillance cameras a year in ten years’ time, and machine learning will likely run on IoT devices as much as on smartphones. “This is a scalable architecture that over time will enable so many different segments down to always-on devices and up into cars and servers,” he said.
“We’re not looking at machine learning as a unique category; it will be native to everything that computers do,” Davies told us. “Wherever possible, machine learning will be done at the edge, on the device.”
Feature image: ARM.