Cloud Services / IoT Edge Computing / Machine Learning

Nvidia Offers Hosted Large-Scale Processing for AI

4 Aug 2021 11:15am, by

Chip manufacturer Nvidia is following through with promises to make artificial intelligence (AI) technologies more available to mainstream enterprise developers and data scientists through easier access to the company’s powerful supercomputing offerings.

The GPU maker this week is making its Base Command Platform generally available to North American companies. The platform is a hosted AI development offering that is based on Nvidia’s DGX SuperPod supercomputer and is part of its larger LaunchPad program for making its AI and machine learning infrastructure available via hosted platforms.

Nvidia has been building up to this over the past several months. Officials unveiled DGX SuperPod at its GTC event in April, describing it as a cloud native, multitenant AI supercomputer. In late May, the company unveiled Base Command and a month later introduced the LaunchPad program and Fleet Command, another hosted platform designed to enable organizations to deploy and manage highly distributed AI workloads.

The vendor is looking for ways to extend AI and machine learning capabilities beyond the realm of research institutes, high-performance computing (HPC) centers and government agencies and to mainstream enterprises for such uses case as natural language processing and video analysis, according to Stephan Fabel, senior director of product management for Base Command Platform.

Buying a DGX SuperPod and bringing it on-premises can come with a hefty price tag, given that the supercomputer armed with Nvidia’s Bluefield data processing units (DPUs) starts with 20 nodes and scales in 20-node increments and comes with networking and storage tools. The cost and work needed to run and manage a SuperPod can put it out of reach for many enterprises, Fabel told The New Stack.

In addition, organizations may not need all of that compute power over an extended period of time. What they do need is better performance when running AI workloads.

“That concept resonated really well with a whole host of additional companies out of a market segment that we typically wouldn’t address with a DGX SuperPod,” he said. “In our push to democratize AI and go into the broader enterprise market, [Base Command Platform] in this hosted environment now is our first step towards expanding into a larger market segment than we traditionally would have been able to target.”

Base Command Platform enables organizations to lease the computer power they need. The DGX SuperPods are hosted in facilities run by data center provider Equinix. In addition, Nvidia is partnering with NetApp on an integrated data management solution that provides high-performance storage that can handle accelerated AI computing.

Buying an NGX SuperPod and bringing it on premises can come with a hefty price tag, given that the supercomputer armed with Nvidia’s Bluefield data processing units (DPUs) starts with 20 nodes and scales in 20-node increments and comes with networking and storage tools.

Another partner is Weights and Biases, a provider of machine learning developer tools that will bring MLOps software to the Base Command Platform. The software will enable enterprises to track experiments as well as data versioning and model visualization. Developers and data scientists are also able to leverage Nvidia’s NGC collection of GPU-optimized software for HCP, AI and machine learning.

While Base Command will lift a lot of the cost burden from enterprises, shifting from capital to operational expenses, it isn’t cheap. Users will need to commit to at least a three-month lease for DGX SuperPod access at $90,000 a month. They can extend the lease or scale up when needed. Enterprises also get a dedicated all-flash storage file from NetApp on a similar subscription basis. Still, Fabel said, the OpEx model and leasing option means that developers and data scientists can pursue AI projects they might otherwise have skipped as too expensive on high-performance infrastructure that will run them faster than other hardware

He also noted that the combination of Base Command and Fleet Command platforms gives enterprise a complete AI solution, with Base Command being used for AI training and creating AI models leveraging cloud-native containers.

“With our sister program called Fleet Command, they can now deploy this in production,” he said. “This one-stop-shop of creating and training and researching models, coming up with better and better models to run their inference on jointly coupled with a platform to actually deploy those, that completes that story and relieves companies and enterprises from having to implement all of that management of software … by themselves. Having that immediate access for the data scientists and then being able to deploy the product of their work immediately into production is where the value is.”

DGX SuperPod is an AI supercomputer that can scale from 20 to 140 DGX A100 systems with Bluefield DPUs, storage that can scale from one to 10 petabytes, a networking fabric that offers up to 200 Gb/s and Nvidia’s DGX and CUDA-X software stack. It delivers 100 to 700 PFOPS of performance.

Nvidia is among a growing number of established hardware makers — including Hewlett Packard Enterprises, Dell Technologies, Cisco Systems and Lenovo — that are shifting quickly to offering more of their portfolios as a service for cloud-based environments. Through these as-a-service efforts, the vendors are enabling organizations to create cloud-like infrastructures for their on-premises environments — such as having flexible consumption models — as well as have offerings that can be accessed via the public clouds.

The goals and models are similar, but Nvidia’s Base Command Platform is more of a rental offering — with leasing rules, baseline timeframes and a monthly fee — than an as-a-service program, Fabel said.

“This is really a dedicated cluster,” he said. “All the high-speed InfiniBand networking, following the straight reference architecture and then leveraging basically a platform as an entry point to consume that. If you wanted to scale that up, then you would have to give us notice and you’d get a fourth node or fifth node and so on and so forth.”