Nvidia is announcing today that its NeMo Megatron product — an open source full-stack framework for developing and managing large language models (LLMs) — will ship with several improvements that reduce LLM training times. Since LLMs are colossal in size — often having hundreds of billions or even on the order or a trillion tunable parameters — even small improvements can be highly impactful. But these improvements are not small; Nvidia says they can trim training times by as much as 30%.
LLMs are a specific type of deep learning/neural network model, used for a variety of natural language use cases, including content generation, text summarization, chatbots and other conversational AI applications. LLMs are also quite versatile, with pre-trained models being generally applicable to numerous tasks, rather than custom-designed for particular ones, as is the case with other types of neural network models. LLMs’ complexity delivers a big benefit, but that only comes as a reward for a great deal of work.
The New Stack spoke with Ujval Kapasi, Nvidia’s VP of Deep Learning Software, who said that “a lot of the work we’ve been doing at Nvidia over the last few years has been to build hardware and software optimized to accelerate the training and the inference and deployment of these neural networks.”
That definitely seems to be the credo in place for these NeMo Megatron improvements, which come down to:
- Two novel approaches in training LLMs: selective activation recomputation and sequence parallelism.
- A new hyperparameter tool that optimizes training based on the desired model size and infrastructure resources available.
Kapasi explained each of these technological advancements in refreshingly plain-English. In colloquial terms, they both come down to working smarter, not harder. I’ll attempt to convey how each of the NeMo Megatron improvements does this.
Go Back and Do It Again
Training deep learning models in general, and LLMs specifically, involves a process of iterative improvement. Kapasi explained that at first, a model produces naive predictions: “the basic approach is… it starts out with completely randomized data…and the neural network makes predictions [that are] completely wrong.” But as those predictions are compared to their actual ground truth values, weightings can be adjusted, and results get progressively better.
As the forward pass to generate the predictions is completed, a lot of memory may be required to retain the parameter values for the backward pass, where the weightings are adjusted. To avoid the memory hit, the values can instead be recomputed, but that drives up the compute resources required. Neither choice seems pleasant, but simply recomputing everything has been the norm.
Turns out, there is a better way. Selective activation recomputation (SAR) offers a compromise. It prioritizes recomputation of values that take a significant amount of memory and whose calculations have relatively small compute needs. This then leaves more memory that can be used to cache parameter values that would involve more resource-intensive recomputation.
Parallelism and Heuristics
Another facet of the LLM training involves parallelization within a model’s transformer layer. While many tasks can be tensor parallelized across multiple GPUs, others are simply replicated on each one. But the new sequence parallelism (SP) technology in NeMo Megatron parallelizes these tasks as well along the sequence dimension, further reducing compute resource requirements and speeding the training process.
Finally, there is the issue of moving past parameters and instead tuning hyperparameters, which govern the training approach taken. Rather than cycling through a range of values by brute force, NeMo Megatron’s hyperparameter tool (HP tool) sets these values based on the compute environment and requirements, for example, the number of GPUs/size of the GPU cluster and the desired size of the model. While some range testing is still involved, there’s much less of it, which speeds the hyperparameter tuning process and optimizes the training strategy, thereby speeding up the broader training process as well.
Again, these three advances together provide training speed-ups of up to 30%, according to Nvidia. The company says that training can now be done on 175 billion-parameter models using 1,024 Nvidia A100 GPUs in 24 days. That may still sound big, but it represents a time reduction of 10 days, which works out to saving about 250,000 hours of GPU compute over building such models without SAR, SP and the HP tool. Multiply that 250,000 number by your cloud provider’s hourly GPU compute cost and pretty soon it adds up to real money (with apologies to Senator Dirksen).
While non-data scientists may find all of this a bit esoteric, the downstream benefits should be clear: a greater number of bigger, more accurate LLMs will be available more quickly, for mainstream developers to use in their own applications. And it all comes down to efficiency, parallelization and better training strategy.
Nvidia says the new NeMo Megatron capabilities are available to early access customers to run on Nvidia DGX SuperPODs, and Nvidia DGX Foundry as well as on the Microsoft Azure cloud. They’re also available on Nvidia LaunchPad, a free hands-on lab platform.