Test-Driven Development with LLMs: Never Trust, Always Verify
As community lead for Steampipe, I’d long wanted a better way to visualize project activity. Since I joined about two years ago, the suite of plugins has grown from 42 to 136, and existing plugins are constantly updated with new tables, enhancements, and bug fixes. All these updates appear in a community Slack channel and on social media, but I’d been wanting an automatic summary of the changes on a monthly or quarterly basis. The raw information is available in GitHub changelogs, and the logs are written in a consistent style, so in theory it would be straightforward to extract structured data from the logs but — as always — the devil’s in the details. Writing regexes to match patterns in the changelogs was an arduous chore that I’d been putting off. Since LLMs are fundamentally pattern matchers, I figured they could help me get it done easier and faster.
For this exercise, I started with a detailed prompt that included sample data, specified the patterns to recognize in the data, and provided sample outputs that could be used in tests that would prove the script worked as expected. The prompt concludes with this ambitious goal:
Write a script to process the data in sample_data.py, and write tests to prove that it produces these outputs.
That was overly ambitious. Although I’m hearing stories of successful whole-program synthesis based on detailed specs, I’ve yet to make it happen. ChatGPT, Sourcegraph Cody, GitHub Copilot Chat, and smol developer (which I tried for the first time) all proposed solutions that were useful bootstraps, then the exercise turned into the now-familiar (and useful!) dialogue with rubber ducks. I wound up writing tests myself, the solution that emerged could pass them, and it came together more easily than it would have without LLM assistance. But I wasn’t happy with the code, and didn’t feel that I’d made best possible use of the LLMs, so I rebooted and tried again with a different strategy:
Write tests, and ask LLMs to write functions that pass the tests.
I’m not sure why we should even expect LLMs to take detailed specs as input and, in a single shot, emit whole programs as output. Human programmers don’t work that way. Even if LLMs could, would we want them to? The goal, after all, is to create software that not only works (provably), but can be understood, maintained, and evolved by the same human/machine partnership that creates it. What’s the right way to keep the human in the loop?1
For the reboot, I focused on the trickiest part of the problem: the regexes. For each pattern (New tables added, Enhancements, Bug fixes, Contributors), I wanted a function that would match the pattern and pass a test proving it could do that against sample data. It’s long been my practice to decompose complex regexes into pipelines of simpler steps that I can understand and test individually. That’s a reliable method, but it’s slow and cumbersome. If machines can quickly write complex regexes that pass tests, I’m happy to outsource the task — especially if they can explain their work. Here’s one of the regexes, which matches the Enhancements or Bug fixes sections.
And here’s the chorus of explanations.
I wouldn’t want to dig into this regex but if I needed to, I’d appreciate these explanations and would consider all of them.
Could the LLMs produce a simpler regex that would be easier for me to understand and modify, while still passing the tests? I pushed them hard but none arrived at a simpler working version. So for now I’m willing to accept a tradeoff: faster development of regexes that are harder for me to understand, but that I can test. It’s always felt like grokking regexes was a job for alien intelligences, and now that we have them I’m glad to be able to direct my human intelligence elsewhere.
Iterative Test-Driven Development
ChatGPT with the Code Interpreter plugin is the gold standard, right now, for iterative generation of functions that are constrained to pass tests. In How Large Language Models Assisted a Website Makeover, I reported a first successful use of Code Interpreter. My tone there was perhaps a bit too matter-of-fact; I’m sensitive to pushback about LLM hype and I’m aiming here for a neutral stance and critical objectivity. But let’s be real: the ability to run an LLM in an autonomous goal-directed loop is an astonishing breakthrough — still nascent, but a likely way to enable more reliable and reproducible uses of LLMs for programming.
I’ve tried to simulate this effect with Cody and Copilot, with no luck so far. I can ask them to write a function that passes tests, give them the tests to pass, and feed the test failures back to them, but I’ve yet to arrive at a successful result using this method. This is a shame, because Cody and Copilot share a key advantage over ChatGPT: they are local, they can see your files, and you can converse with them in a way that doesn’t require pasting everything into a prompt window. I expect both will acquire the ability to iterate in an autonomous loop and look forward to seeing how they perform on a level playing field.
Over 100 separate source files are concatenated into a single large file of C-code named “sqlite3.c” and referred to as “the amalgamation”. The amalgamation contains everything an application needs to embed SQLite.
This bundling strategy is a good way to work with LLMs.
Although Code Interpreter can run in an autonomous loop that converges on a result that passes tests, more often than not it failed to do so for various reasons. Here are some of its many apologies.
It appears that the code execution environment was reset, which means the state of the script, including function definitions and variables, has been cleared.
It seems I made a mistake by not redefining the run_tests() function before attempting to run it, which is why the error indicates that run_tests is not defined. My apologies for the oversight.
I made an oversight and accidentally truncated the changelog again. Let me correct that and run the tests once more.
I have no insight into what’s happening under the covers, but it feels like pieces of code are being swapped out to stay within the context limit, and there’s constant juggling to maintain the necessary context. That’s fine if the autonomous loop eventually converges on a result that passes the tests — though it can take a while — but here’s a more troubling issue.
GPT: The tests ran successfully this time. The adjusted regex patterns correctly extract the desired information from the changelog, and the tests validate that this extraction is accurate.
Jon: You claim it passes the tests, but it doesn’t. Why did you say it does?
This happened a few times, I never got a satisfactory answer and resorted to capturing the LLM’s proposed code change, placing it into my copy of the code, and running the tests myself. That was no great hardship. When the autonomous loop does iterate to a correct result, describes the intermediate steps as it performs them, and correctly reports that the result passes the tests, it’s utterly magical. I expect that magic will grow stronger as platforms gain experience running LLMs in this mode. But meanwhile, I recommend a variant of “trust but verify”: never trust, always verify. Just as ChatGPT can make up facts, it’s apparently willing to lie about ensuring that the code it writes passes the tests you give it. It can also behave like a recalcitrant child who knows, but must constantly be reminded, to follow the rules. But if you hold its feet to the fire, tests can be a great way to focus its attention on the code you’re asking it to write.
1 I’m not actually a fan of the phrase “human in the loop” because it cedes agency to the machine. I’d much prefer “machine in the loop” but am not going to die on that hill.