TNS
VOXPOP
Will JavaScript type annotations kill TypeScript?
The creators of Svelte and Turbo 8 both dropped TS recently saying that "it's not worth it".
Yes: If JavaScript gets type annotations then there's no reason for TypeScript to exist.
0%
No: TypeScript remains the best language for structuring large enterprise applications.
0%
TBD: The existing user base and its corpensource owner means that TypeScript isn’t likely to reach EOL without a putting up a fight.
0%
I hope they both die. I mean, if you really need strong types in the browser then you could leverage WASM and use a real programming language.
0%
I don’t know and I don’t care.
0%
AI / Operations

Meeting the Operational Challenges of Training LLMs

To train a large language model, you must overcome three big challenges: data, hardware and legal. It helps to be a large organization, too.
Aug 14th, 2023 5:00am by
Featued image for: Meeting the Operational Challenges of Training LLMs
Image by Conny Schneider from Unsplash. 

Large language models (or LLMs) such as GPT-4, GPT-NeoX, PaLM, OPT and Macaw are a recent breakthrough in machine learning that have particularly caught the public imagination, with OpenAI’s GPT 4.0 garnering the bulk of the media attention.

Generally based on transformer architectures with potentially hundreds of billions of parameters, models are pre-trained on an extensive corpus of self-supervised textual data, and aligned with human preferences via techniques such as reinforcement learning with human feedback (RLHF).

LLMs have demonstrated significant capabilities over a wide range of tasks, including summarization, content generation, coding, and translation. They also have limitations. “ChatGPT is not a knowledge engine, it just autocompletes text,” Roland Meertens, a machine learning scientist, told The New Stack.

As OpenAI itself notes when talking about GPT-4, the LLM “hallucinates” (or invents) facts, a phenomenon that is still not fully understood, and will also make reasoning errors.

Given that hallucinations can create made-up information, OpenAI stated that, “great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.”

In other words, using an LLM in a situation where accuracy matters, at least without considerable human oversight, is unwise.

While the training methodology is seemingly straightforward, training a model from scratch isn’t a trivial undertaking. However, for corporate users, “if you have a large amount of data you can use, and you don’t want to share any of it with a third party like OpenAI, it can make sense to keep everything in-house,” Meertens said.

“But you do need to be a big player, like a tech giant, in order to make this work — unless your domain is in some way very constrained, and you can get away with a smaller model.”

Why You Need Significant Resources to Train LLMs

The reason that training an LLM is best suited for large organizations is that you have at least three major barriers to overcome: data, hardware and legal. We’ll begin by addressing the first two.

For starters, “there is a data moat in terms of who can actually get access to data on the scale that is required by these models,” Phil Winder, CEO of Winder.AI, an AI consultancy firm, told The New Stack. “That really limits you to the current data ‘gods’—firms like Google and Facebook.”

While there are publicly accessible data sets, such as Common Crawl, which was used in the training for GPT-3, ethical considerations can arise.

The issue is that datasets pulled from the internet require extensive cleanup before they can be used — and this comes at a considerable human cost since much of the material that needs to be removed is graphic details of NSFW content.

“The good thing about AI projects is they are all greenfield. And on a new project,  you have no excuse to not be architecting to take account of things like time-shifting and demand shaping. But also start lean and don’t overegg it; that is, don’t burn huge amounts of energy training a model that no one is interested in using.”

— Anne Currie, veteran tech ethicist

As Time magazine uncovered, OpenAI has attempted to solve this by taking a leaf out of the playbook of the social media companies and outsourcing the work to people in places such as Kenya who have fewer work options, paying the individual reviewers less than $2 per hour.

The outsourcing firm the company used for this, Sama, canceled its work for OpenAI in February 2022, eight months earlier than the contracted period, in part because of the traumatic nature of it.

The second barrier comes from the fact that training an LLM requires access to high-performance hardware and large numbers of specialized accelerators such as GPUs or Google’s custom-developed Tensor Processing Units (TPUs).

Training Google’s PaLM, for example, required 6144 TPU v4 chips made of two TPU v4 pods connected over a data center network. Meta AI’s OPT was comparatively computationally efficient, but still used 992 80GB NVidia A100 GPUs.

As you would expect when operating at this sort of scale hardware failures are common, requiring either manual or automatic restarts during the training process.

Meta’s OPT whitepaper describes how “in total, hardware failures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of two months. During manual restarts, the training run was paused, and a series of diagnostic tests were conducted to detect problematic nodes. Flagged nodes were then cordoned off and training was resumed from the last saved checkpoint. Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate 70+ automatic restarts due to hardware failures.”

Training is also hugely time-consuming, typically taking hundreds of thousands of compute days. To offset this, a mixture of parallelism techniques are used to partition the models into pieces that fit into the memory of individual devices and to efficiently use the compute across them, with the results combined at certain intervals in order to obtain the global model.

Training both the PaLM and OPT models was done using a combination of data parallelism, which shards the data and distributes it to the various nodes, and tensor parallelism, which divides large matrix multiplications into smaller sub-matrix calculations, and then executes them while using multiple GPUs.

This is obviously more efficient, but does require high communication bandwidth between nodes, with InfiniBand often used to move the data, and it adds to the training costs. A rough, back-of-an-envelope calculation estimated that the cost of training the PaLM model might be as high as $23 million.

How Green Are LLMs?

We can assume that training also generates a high carbon footprint, although actual numbers are hard to come by, and greater transparency is needed.

OpenAI’s GPT-4 whitepaper provides no information on architecture (including model size), hardware and so on. Its GPT-3 paper, however, notes that “practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model.”

A separate research paper published in 2021 estimated GPT-3 used 1,287 gigawatt-hours, generating 502 tons of carbon emissions — equivalent to 120 years’ worth of a single U.S.  family’s electricity use. Some other estimates have put it considerably higher.

That said, some progress is being made in terms of hardware efficiency. Meta has stated that it “developed OPT-175B with energy efficiency in mind by successfully training a model of this size using only 1/7th the carbon footprint as that of GPT-3.”

But despite this, we’d argue that the carbon cost of creating an LLM should be a factor you seriously consider when deciding whether to train a model from scratch.

Creating GPT-3 used an estimated 1,287 gigawatt hours, generating 502 tons of carbon emissions — equivalent to 120 years’ worth of a single U.S.  family’s electricity use, according to a 2021 research paper.

“If possible use something that already exists,” Anne Currie, industry veteran and co-author of the forthcoming O’Reilly book “Building Green Software,” told The New Stack. “But if you absolutely have to start from scratch, remember it is not a latency-sensitive workload, so you should never do it at a peak time.

“It is also a good idea to wait as long as you possibly can, because hardware gets better at doing particular jobs. So the longer you leave it, the more efficient the hardware will be and the less energy will be required to do the training.”

Also keep in mind that if a model is successful, training it will only be a small part of the total energy consumption in the end; querying the model should actually be where the bulk of the work is. Because of this, it makes sense to architect the software with sustainability in mind from the get-go.

“The good thing about AI projects is they are all greenfield,” Currie said. “And on a new project, you have no excuse to not be architecting to take account of things like time-shifting and demand shaping. But also start lean and don’t overegg it; that is, don’t burn huge amounts of energy training a model that no one is interested in using. Do it in a lean way, and establish if there is demand for the model first. Greenfield projects have to be green and lean.”

It is also worth keeping in mind that mistakes during training can be hard to fix, particularly later in the process, and you could find yourself in a situation where you spent thousands or even millions of dollars and ended up with a suboptimal model.

Somewhat related, as a developer you have to be cognizant of the potential risks. According to Meertens, “You need to be thinking all the time what the implications of false negatives and false positives are, and what the consequences are to society if something goes wrong.”

The Messy Legal Issues of LLMs

Then there are legal considerations. This is still messy; since the technology is so new we don’t really know what the rules are, or what the business models look like.

One issue is copyright, with authors, comedians and programmers among those arguing that AI models were trained on and reproduce copyrighted material.

Whatever the eventual outcome, “OpenAI, Google, etc. have got a whole suite of lawsuits coming up that are basically saying, we never gave you permission to use this in your model,” Winder said, “And only those companies can afford the legal costs to actually take those cases through to fruition.”

With GPT-4, OpenAI has tried to keep the training set secret. But, at least according to the Washington Post, the U.S. Federal Trade Commission is demanding that the company open this information up to scrutiny, so it may not stay secret indefinitely.

It does seem likely that we will see companies that have secured the legal rights to use training data and will be able to provide you access to it at a cost, but that whole space is still nascent.

There are also complexities around privacy laws, a topic which researchers affiliated with Australia’s National Science Agency and Australian National University, have recently explored.

In essence, they argue, chatbots and associated machine learning applications will have to be capable of forgetting what they’ve learned, under the rules set out by the European Union’s GDPR legislation, as well as the California Consumer Privacy Act, Japan’s Act on the Protection of Personal Information and Canada’s proposed Consumer Privacy Protection Act.

It is not yet clear whether this is possible.

“Add to that there are laws being enacted around the world that place burdens on these companies to do certain things, like the EU AI law,” Winder said. “OpenAI CEO Sam Altman has been traveling the world saying how dangerous AI can be and how we need regulation.

“What he’s trying to do is to build yet another moat, a legislative moat, with the goal of getting the legislation so complicated and comprehensive that no small company could ever compete, and he’s actually succeeding.”

Winder pointed to the EU AI Act, which includes a section on foundation models that requires organizations to jump through bureaucratic hoops. “This also applies to open source foundation models,” he noted “which is going to make it very challenging to have open source language models going forward.”

Given the environmental impact, cost, technical complexity, and ethical and legal considerations, we’d typically recommend that you consider using the API of a commercial or existing open-sourced LLM before embarking on the process of training an LLM yourself.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, The New Stack.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.