Debugging Software Using Generative AI
Adoption of generative AI tools and large language models (LLMs) since OpenAI introduced its ChatGPT chatbot in late November 2022 has only accelerated, reaching deep into organizations of all shapes, sizes, and sectors, and software developers have not been immune to the siren call.
The use cases of generative AI — such as content creation, conversational AI, and language translation — in software development are varied and growing, touching on everything from code optimization and generation to fixing bugs, documentation, and continuous integration.
Developers increasingly are seeing generative AI as a useful tool, according to a post by AI experts in Carnegie Mellon University’s SEI blog in October 2023.
“The initial hype around using LLMs for software development has already started to cool down, and expectations are now more realistic,” the authors wrote. “The conversation has shifted from expecting LLMs to replace software developers (i.e., artificial intelligence) to considering LLMs as partners and focusing on where to best apply them (i.e., augmented intelligence).”
LLMs and Software Verification
A group of computer scientists led by the University of Massachusetts Amherst earlier this month said they were directing the power of generative AI and LLMs to address the gnarly challenge of verifying code to help prevent vulnerabilities in software.
Specifically, the team focused on using AI to develop a method to automatically generate whole proofs to use to verify software code, a painstaking, expensive, and time-consuming process that typically has been done through manual procedures. An even more difficult process is machine checking: creating a mathematical proof to show the code does what’s expected and then using a theorem provider to ensure the proof is correct.
Emily First, who became a postdoctoral researcher at the University of California San Diego after completing her Ph.D. in computer science at UMass Amherst — the research into LLMs and software verification was part of her doctoral dissertation at UMass — noted that manually writing the proofs can take much more time than writing the software code itself.
As such, the systems that have actually been formally verified is fairly small, according to Yuriy Brun, professor in the Manning College of Information and Computer Sciences at UMass Amherst and the paper’s senior author.
“It’s hard,” Brun told The New Stack. “This is where we come in. The problem we’re trying to solve is we’re trying to automatically generate these proofs.”
Buggy Software Shouldn’t Just Be Accepted
Doing so will help address an even larger problem: the presence of flaws in software that can be annoying or — if exploited by cyber-attackers or present in complex systems whose failure could have broad negative impacts — dangerous.
“Software is an important part of our everyday lives,” Brun said. “You can’t do anything. You can’t drive a car, you can’t ride an elevator, without using software. Unfortunately, software today, as a rule, is buggy. Any software you buy in a store, we almost expect to have some bugs. It’s just it’s too hard a problem to solve, so there’s a bunch of different ways to try to improve quality of software.”
One way is to prove the software is correct. It’s an effective method, but also among the most difficult. It’s done in some fields — in healthcare for some medical devices or by NASA, “because if you have a bug on a spaceship and your software crashes, that’s going to be costly, so it’s worth the cost of actually having the developers and software engineers formally prove the correctness of the function,” he said.
Some researchers have created models that can write a proof a line at a time, writing the first 10 lines of the proof and then asking the model to essentially search, based on what’s been written and what’s trying to be proved, for what the next most likely line should be.
The UMass Amherst researchers looked to LLMs as a possible solution for automatically generating proofs. However, even that posed challenges, the biggest one being that the models aren’t always correct. Rather than crashing — and thus letting the developer know something is wrong — they “fail silently,” producing an incorrect answer but presenting it as thought it’s correct. Failing silent often is the worst thing that can happen, Brun said.
Enter Baldur, a new AI-based method for developing proofs the UMass Amherst team created. The researchers used Google’s Minerva LLM, which is trained on a large amount of natural-language text. It was then further trained on 118GB of mathematical scientific papers and web pages containing mathematical expressions and then trained even more on Isabell/HOL, a language in which mathematical proofs are written.
From there, Baldur generates an entire proof, with Isabelle, a theorem prover, checking the world. If the checker found an error, information about the error was fed back into the LLM to let it learn from its mistake and then generate another proof with fewer or — hopefully — no errors.
Doing this gives Baldur “some context to say, ‘Here’s some more information about the state, about the question that you’re answering for me,’” Brun said. “When we gave it that extra information, it was able to answer the question more. We did this repair thing once, but you could imagine doing it a bunch of times, where it’s much harder with these one-at-a-time models, even when they use large-language models to predict the one step at a time. That’s a lot less efficient.”
Brun and his team — which also included Markus Rabe, who worked at Google at the time, and Talia Ringer, an assistant professor at the University of Illinois-Urbana Champaign — looked at Thor, a framework for integrating language models and automated theorem provers. By itself, Thor could produce proofs 57% of the time, he said.
Combining it with Baldur — Thor’s brother in Norse mythology — they were able to create proofs 65.7% of the time. The two methods complemented each other.
Thor “uses a large-language model to try to predict what the next likely step is of the proof, but it also uses these things called ‘hammers,’” Brun said. “Hammers are these mathematical tools that say, ‘I know a bunch of mathematical tags. Let me try them. Let me try this, try this, try this.’ It’s like hammering at the problem, just trying things differently and seeing if anything works. It was trying all of these things at once.”
The Baldur method is different in that it creates whole proofs rather than going line by line, though there is overlap, he said, adding that “there are some things that both of them can prove. But by trying to generate a whole proof at once, we’re able to prove a different set of things than this kind of iterative approach of trying to generate one thing at a time.”
Still More to Do
Brun admitted that the degree of error is still large, but said Baldur still represents the most effective and efficient way to verify that software code is correct. AI techniques are continuing to improve, so he expects Baldur’s capabilities will as well.
Going forward, the researchers are looking to improve the 65.7% number by fine-tuning the data the LLM is trained on. A hurdle for verification is that, right now, there isn’t a lot of data out there, so creating a dataset isn’t easy. They’re working on a dataset that will enable the fine-tuning of the model.
They also want to enable developers to give the model feedback, which will further help the model to grow as it works to generate a proof.
“The developer says, ‘Okay, so it worked great,’” Brun said. “But if it didn’t work, the developer can often look [and say], ‘I see that you tried induction here, but you’re trying it on the wrong thing.’ It can give some feedback to the model, and then the model can try again. This iterative approach is likely to really simplify the developer’s job but enable the model to prove things that it can’t on its own.”
It creates a semi-automated approach.
“The original iterative approach didn’t involve the developer,” he said. “It was iterating and doing one thing at a time because it was … doing it all itself, checking things itself. We’re trying to involve the software engineer back in the loop, so it’s a semi-automated approach where we’re doing a lot of the automation, but then we’re getting a little bit of feedback from the engineer to guide the process in cases where we can’t do the whole thing ourselves.”
The researchers have been working on the challenge since 2020, but work on Baldur started in May 2023, with the work taking about five months, from building it within Google to evaluating it and ensuring the science was correct. The project was sanctioned by the Defense Advanced Research Projects Agency (DARPA) and the National Science Foundation.