Although the $40 billion deal still faces long scrutiny by regulators, the reasons behind Nvidia’s proposed acquisition of Arm are becoming clearer, as AMD announces it’s planning to buy FPGA maker Xilinx.
Nvidia doesn’t plan to change Arm’s IP licensing business model, or replace its Mali GPU with Nvidia technology, CEO Jensen Huang has stated repeatedly (Arm licensees are already free to mix and match different GPUs and accelerators in the SoCs they build). This is about targeting data centers, in the widest possible sense, but also capturing all the value Nvidia can bring to data centers with the Arm ecosystem.
Nvidia isn’t just a GPU company with a side-line in AI acceleration. Last year it bought Mellanox for its networking hardware, Cumulus for network virtualization and SwiftStack for data storage and management (especially in the cloud). The Arm acquisition will add CPUs to the mix, allowing it to deliver almost a full stack of hardware and software, but it also brings Nvidia a large ecosystem of partners and a brand new business model.
These acquisitions aren’t just about the hardware integration Nvidia can deliver itself; they could also make the company a one-stop-shop for data center hardware architecture, as a way of competing with Qualcomm and Intel, who both take platform approaches.
“The new unit of compute is the data center — whether that’s cloud native applications running across an entire data center or edge computing with a whole data center on a chip someday,” Huang told us at the GTC conference. “We want to go build a computing company for the age of AI.”
Nvidia is already a “full-stack company,” as Huang puts it, but it’s not vertically integrated. As well as selling GPUs and systems-on-a-chip (SOCs), Nvidia already designs DGX servers and EGX edge devices, that you can lease, get as a service from cloud providers or buy from partners like Dell. It will sell those partners the GPUs, the motherboard or everything including the system software.
Today those use Intel and AMD processors; now Nvidia will add Arm processors to the lineup, and offer that expertise in creating data center systems and complete platforms rather than just components. “It starts with great chips but the stack is a lot more complicated than that, just as cloud computing platforms take more than a server,” Huang said
According to Huang, the strength of the Arm ecosystem is that SoCs are bespoke and often application-specific, with thousands of customers producing billions of chips that Arm developers can address, but the strength of the x86 ecosystem is that it’s a configurable open platform. Data centers and edge computing environments require not just the x86 software ecosystem (which is increasingly available for Arm), but, Huang says, the rest of the platform.
The parallelism and power efficiency Arm can offer have always been appealing but it’s only in recent years that it’s been able to offer the performance-per-thread required for data center servers.
“We know exactly what to do with the rest of the platform: we bring the networking, the storage, the security, all of the IO, all of the necessary system software for every single version of the operating system you want to want to think about, for the applications that we really care about which is accelerated computing and AI.”
Nvidia wants to offer as many of the pieces of that data center and edge platform itself. Down the line, Intel and Nvidia will be competing on discrete GPUs, on data center CPUs, on AI acceleration, on networking hardware from NICs to SoC-level interconnects (and on IoT) as well as on software development APIs, especially for AI and machine learning. That leaves just storage and memory, which Huang confirmed are areas Nvidia won’t move into.
“We will only go into markets that where the market needs us to, and if the market doesn’t need us to we prefer not to do. We only build things that we need,” he said. Nvidia is a computing platform not a computing appliance company — but it expects to sell chips to OEMs building storage servers.
One reason Nvidia bought SwiftStack was for its cloud connector, which is about getting cloud data flowing smoothly through machine learning and high-performance computing (HPC) pipelines without the need to move to all-flash storage for caching. That fits into Nvidia’s vision of AI at scale, without dragging them into the mostly commodity market of memory and storage or attempting to compete with Intel’s lengthy and significant investment in co-developing next-generation persistent memory solutions.
Arm in the Cloud
The parallelism and power efficiency Arm can offer have always been appealing but it’s only in recent years that it’s been able to offer the performance-per-thread required for data center servers. Arm’s Neoverse platform (and its Project Cassini standardization initiative) is making the architecture more capable of running standard data center workloads like Java, NGINX, MySQL, Redis and Kubernetes.
“At 128 cores we believe our N1 CPU will outperform anything in the marketplace, both on socket throughput, and on performance per thread” — Chris Bergey, Arm.
The migration didn’t go ahead because the motherboard provider it planned to use pulled out of the market, but Cloudflare was ready to move Arm servers because all the software it needs has been recompiled to run on Arm, and the price-performance improvement would have been significant.
“The cloud provider, who is typically power distribution constrained, can host more customers per rack leading to higher revenue and more compute cycles towards their business, and the customer gets better performance per core,” Chris Bergey, general manager of Arm’s infrastructure group told us. Azure already uses Arm servers for some internal workloads, Amazon Web Services offers Arm VMs and Ampere is already planning a 128-core Neoverse N1 processor at the end of 2020; “At 128 cores we believe our N1 CPU will outperform anything in the marketplace, both on socket throughput, and on performance per thread,” Bergey said.
The new Neoverse V1 platform is optimized for CPU performance, even if that means using a little more power or space for the processor. “We’re adding larger buffers, larger caches, windows and queues; all the microarchitecture structures that allow a single thread to execute quicker.” It will also run Scalable Vector Extensions — first implemented in the Fujitsu Arm processor used in Fugaku, currently the world’s fastest supercomputer. “Fundamentally, for HPC and machine learning workloads, wider vectors can offer a more application performance,” Bergey claimed.
He called the Neoverse N2 coming in 2021 “an even higher performance option for scale-out performance class core” for cloud and edge devices. “They won’t quite have the performance per thread of V1, but will support more cores, in a constant TDP [thermal design power]. If your application is very CPU and bandwidth-demanding, then V1 will give you the best performance per thread. But if your application is more scale-out and needing more cores then N2 may be a better choice as you will find more instances, with higher core counts.”
Intel’s dominance in the data center market and the number of Arm server motherboard suppliers who have fallen by the wayside mean that even with these impressive processors, off-the-shelf Arm servers may not be a significant opportunity yet.
Semiconductor producer Marvell announced in the summer that it was shifting to custom Arm server development for its 96-core ThunderX3 processor (shipping later this year), particularly for hyperscale cloud providers. “The long-term opportunity is for Arm server processors customized to their specific use cases rather than the standard off-the-shelf products,” CEO Matt Murphy said on an earnings call.
While the Neoverse platform will make it easier for vendors (including, perhaps, Nvidia) to build Arm servers without doing the kind of processor development Marvell and Fujitsu have invested in, if the Arm server market is shaping up to be more about custom integrations, NVidia is very well placed for that.
The Age of AI
It also has the hardware for acceleration that is increasingly important; running algorithms in hardware rather than in software is more efficient. The Arm architecture has a longer runway on fending off the end of Moore’s Law than Intel and AMD (Arm’s weak memory model makes it easier to get more parallelism at a lower cost, which improves performance for parallelizable workloads). But the same problems will eventually apply to them as well, Huang pointed out.
“It’s a foregone conclusion that data centers need to accelerate computing now.”
Although many startups are developing custom AI accelerators that will be more efficient than using a GPU (Microsoft, Google and Facebook all build their own hardware to do this for their own custom workloads) and Intel is adding instructions to CPUs for running AI tasks as well as building accelerators, Nvidia GPUs are currently the de facto data center GPU and AI hardware acceleration option for everything from VDI to machine learning on Kubernetes. They include “tensor cores” in addition to the usual shader cores that do graphics rendering (and can be repurposed for parallel computing with CUDA); tensor cores are optimized for the matrix operations that underlie machine learning training.
Arm and Nvidia also both have significant software efforts supporting AI development; Nvidia bringing CUDA (and its domain-specific acceleration libraries) to Arm substantially expands the workloads it will be suitable for.
But just dropping GPUs into a server rack doesn’t take advantage of all the integration options. Networking is part of that; as Kubernetes co-creator Brendan Burns pointed out to the New Stack recently, Azure has enough InfiniBand available for interconnecting bonded GPUs that AKS needs to be able to schedule workloads that can take advantage of the speed onto that specific hardware.
The “AI supercomputer” Microsoft built-in Azure this year uses Nvidia A100 Tensor Core GPUs — which Azure CTO Mark Russinovich said at the recent Ignite conference are “the current state of the art for optimized DNN — interconnected with high bandwidth, low-latency NVLink and NVSwitch and using InfiniBand between the GPU clusters to scale from the eight GPUs in one cluster to thousands of GPUs for training at scale. That’s the infrastructure that OpenAI ran its 175 billion parameter GPT3 model on. It was built with AMD Rome CPUs, but this is a market Arm’s new generation of Neoverse CPUs are designed for.
Intel is ahead of Arm on machine learning workloads in the data center, Kevin Krewell, principal analyst at TIRIAS Research told the New Stack. “Intel has an array of accelerators and DLBoost instructions in the CPU. Arm’s accelerator strategy is still developing. Now the combination of Arm with Nvidia CUDA ecosystem will certainly change that equation. This is why the combination could be a very potent competitor to Intel (and AMD).”
The Mellanox acquisition doesn’t just give Nvidia expertise with InfiniBand (also useful for integrating high-speed storage); it brings a SmartNIC architecture built on Arm processors that Nvidia is branding as a new type of processor — the data processing unit (DPU), for moving data around.
The idea of a SmartNIC is that for some workloads, the network card can offload work that would otherwise slow down the CPU, whether that’s network function virtualization that’s been moved from network hardware to commodity servers or on-server functionality like running a firewall. AWS built its Nitro SmartNICs on Arm chips but when networking started consuming CPUs that Microsoft wanted to use for IaaS hosting in Azure, Microsoft turned to FPGAs to deliver smart networking without the CPU load and the same FPGA infrastructure now powers workloads from the Bing index to encryption, and especially AI acceleration.
Few enterprises will need or have the ability to use FPGAs and while programmable SmartNICs promise similar benefits, Mellanox had struggled to find customers sophisticated enough to take advantage of them beyond the storage server vendors and hyperscale clouds; that’s the reason Intel’s SmartNICs are less programmable, letting users load pre-written accelerations from a gallery.
“We believe, along with many in the industry, that going forward, every server regardless of the workload it performs will require one of these data processing units inside of it.” — Manuvir Das, Nvvidia
BlueField 2 has both CPU and tensor cores for offloading various IO, storage and security functionality from the server CPU to the network card like storage RMDA processing, 100GBps IPsec, real-time traffic inspection and online analytics of video files.
“We bring the power of AI to the bandwidth and storage,” Manuvir Das, head of enterprise computing at Nvidia told us. “We can use the tensor cores to do intelligent analysis of what is going on on the network, like intrusion detection, where, what you want to do is identify what is normal versus abnormal behavior so you can proactively detect and block abnormal behavior.”
Doing security in the network card provides more isolation, Das pointed out (because malware can be analyzed and blocked without affecting CPU performance). “In the modern data center security has to be done not just at the perimeter but at the level of the server, so that every application can protect itself from other applications. This is why we believe, along with many in the industry, that going forward, every server regardless of the workload it performs will require one of these DPU devices inside of it.”
Nvidia’s deal with VMware to support BlueField-2 DPA in vSphere (including for Kubernetes) by moving functionality from the ESXi hypervisor to the DPU makes the advanced functionality more accessible, Das told us. “Customers will be able to benefit from the DPU transparently; they don’t even have to think about it, they just upgrade to the latest version of vSphere.” Nvidia is also talking with mainstream server OEMs and offering a software framework, DOCA — Data Center Infrastructure-on-a-Chip Architecture — with an SDK and APIs for working with existing accelerating networking and storage drivers like DPDK and PSDK so that software vendors and open source projects can build in DPU support.
Softening Nvidia IP
Mellanox had been working on BlueField DPUs using Arm chips without ever needing to consider acquiring Arm. Nvidia already has its own Arm license for building the Tegra SoCs used in phones, tablets, cars and the Nintendo Switch. “We want to bring Nvidia accelerated computing to Arm, which is the world’s most popular CPU, and offer acceleration to the Arm ecosystem,” Huang said — but why did that mean buying Arm?
It’s because if Nvidia brought its IP and its partner and developer ecosystem to the Arm market without owning Arm, it wouldn’t get the benefit of owning the licensing and distribution channel, he told us.
“All successful companies have a distribution channel and their network to the customer is one of the most valuable parts. I can license Arm’s CPU core, but I can’t license the distribution channel. Their distribution channel, the word I use is ecosystem, because it’s thousands of hardware makers, it’s hundreds of systems makers. It’s millions of software developers. We don’t get the benefit of that by being a CPU licensee. We would like to put Nvidia architecture and Nvidia accelerated AI into that channel. That’s what I’m paying $40 billion for; that took 30 years to build. It’s not something we can get without buying the company.”
When Nvidia designs reference systems like the DGX and EGX, Huang says they sell a “handful” of them as full systems, but they always sell suppliers components. “We create the systems and then we atomize them into chips,” as Huang puts it. Now, Nvidia can adopt Arm’s licensing model for its own IP — with a ready-made audience — instead of only selling hardware (and giving CUDA away as a reason to buy it).
That could be easier said than done, Krewell warned. “How Nvidia delivers its IP (and how much it charges) will be an interesting challenge. Jensen is pitching that Nvidia is also an IP company, but delivers its IP in more hardened form. The real challenge is how freely Nvidia will open up its IP to the whole Arm ecosystem — including potential competitors. Nvidia will need to earn the trust of the Arm ecosystem.”
But Nvidia does have experience appealing to developers, he noted. “The most important part of the integration story will be the software stack. Nvidia has a great story here.”
Huang also believes that Arm needs Nvidia’s capabilities to make the move to servers. He compared the stack that Arm and Nvidia can offer to the innovation that Amazon brought to cloud computing. “When you look under the hood of AWS, one of the most valuable computing platform companies in the world today and they innovated cloud computing, you know what’s inside — a bunch of servers. A server is involved in cloud computing but a server is not what cloud computing is, it includes a lot more. The software stack is very novel. The way that it’s managed is very novel. Cloud computing is not just a bunch of servers sitting in a data center, otherwise, a whole bunch of server companies would be cloud computing companies and they’re not.”
Stacking the Future
The future of servers (including cloud computing) is increasing heterogeneous, with a mix of cores and accelerators — not just inside a server, but packaged together at the processor level, using open interconnects. Arm has been working on chip-level interfaces including the CCIX protocol for communications between cores; that’s moving from multi-socket computing to chiplet architectures that package multiple technologies together (which Intel has been investing in since 2018, for both CPUs and FPGAs).
“What we see coming after [chiplets] is tightly coupled heterogeneous compute,” Bergey said. “With the slowdown of Moore’s Law scaling, there is interest in chip to chip coupling of ARM CPU complexes with a variety of accelerators and memory.” NVidia’s IP could be very useful there.
Arm CEO Simon Segars mentioned the importance of this at Arm DevSummit earlier this month. “Putting different die together that were manufactured on completely different processes in really sophisticated packaging, that all matters. Being able to stack die together, to true 3D. I see these as the innovations that are going to keep setting up the performance generation after generation.”
Arm has also been adding support for CXL, for memory pooling and expansion which require cache-coherent interconnects, Bergey added. “It could involve sharing a large pool of memory across a set of connected nodes, or it could mean just attaching a large amount of emerging memory to a single node CSL is proving to be the preferred way to attach accelerators where accelerators in the host can coherently access each other’s memory. The most obvious use cases here are ML training and inference, but we expect new use cases to emerge by the time this hits the market.”
That’s expertise Nvidia doesn’t have, Krewell said. “We have yet to see Nvidia embrace chiplets. The company tends to bet on its ability to build large chips. That will change as Moore’s Law slows and performance requirements still need to scale up. CCIX and CXL will become important in 2021 to allow more intelligent data movement/sharing between CPU and accelerators.”
But there are also emerging trends that Nvidia’s acquisitions haven’t yet prepared it for.
Hyperscale clouds are prone to the problem of fragmentation; sharing fixed resources between customers means carving up hardware in ways that can leave isolated pockets of unused resources. To avoid that, hyperscalers like Azure are looking at how to disaggregate hardware inside the data centers; a single security root of trust for motherboard firmware and peripherals alike, computational storage like separating the storage in an SSD from the controller to add inline acceleration for compression, or extending the loosely coupled architecture of GPU clusters by pooling and composing disaggregated GPUs and FPGA over optical interconnect.
At this year’s OCP Summit, Microsoft distinguished engineer Kushagra Vaid suggested that disaggregation is the way to keep Moore’s Law going. “While many are arguing that we are reaching the limits of Moore’s Law, we believe that Moore’s Law can be applied to halving the cost every two years for the whole data center campus — not just the chip. This means that collectively the network, the hardware, and data center are optimized as a system to enable that goal.”
Disaggregation will push so much traffic within data centers that it may require moving to “co-packaging” silicon and optical networking (rather than the current plugin optical modules) that may need silicon photonics — for reducing power and cooling needs as well as latency.
Microsoft and Facebook are collaborating on designs for co-packaged optics, Cisco bought silicon photonics company Luxtera in 2018 to help it develop next-generation ASICs using “co-packaged” optics and Intel and Xilinx are both working on packaging optics with FPGAs for fast interconnects. Intel has already shipped 3 million silicon photonics transceivers and is using silicon photonics in Tofino 2 programmable Ethernet switch on a chip. Last year it acquired Barefoot Networks for its expertise building those, and it already builds SmartNICs that could have silicon photonics ports.
Mellanox had been working on silicon photonics, but shut that development down in 2018, so Nvidia may need to restart that or make further acquisitions to compete directly here.
But it certainly looks as if delivering the future of the data center will increasingly require a large portfolio of technologies.
AMD has now announced it’s in advanced talks to buy Xilinx, whose FPGAs are used in SmartNICs, computational storage and AI inferencing accelerators (sometimes with Arm cores). Xilinx CFO Brice Hill predicted recently that data center would be the company’s fastest-growing segment (as well as noting that Xilinx needed more experience working with hyperscalers to succeed — experience that AMD already has).
Again, that’s an acquisition that would make it easier for AMD to create heterogeneous compute packages using cache-coherent interconnects for CPUs and accelerators. AMD is also investing in the software front, from its open source ROCm (roughly equivalent to Intel’s oneAPI programming model) to new tools for working with CUDA-competitor OpenCL.
Assuming Intel can deliver on discrete GPUs this time (it’s failed to do so in the past), it will be able to offer a full-stack software and hardware platform, from CPU to GPU to network hardware to acceleration to oneAPI, that it believes customers want. This is the level Nvidia wants to compete at, and buying Arm is a way to get the most value out of the platform approach along the way.
Nvidia will have early access to Arm designs and raising licensing costs will be tempting, but to keep Arm’s industry-wide value, its IP will need to stay affordable and neutral. The question is whether it can successfully integrate Arm, build on top of it, shift its own licensing pattern and create full-stack solutions without killing Arm’s neutrality — because multiple suppliers building on the same IP is what’s driven so much innovation in the Arm ecosystem.
Amazon Web Services and Intel are sponsors of The New Stack.