Can real-world programming problems be solved with state-of-the-art AI? This month DeepMind explored that question, confronting the world with a fresh perspective on programming, and on the capabilities and limits of artificial intelligence.
But what’s equally interesting is the lessons they learned along the way — about what can and can’t be automated, and about the errors in our current datasets.
And while the AI-generated solutions weren’t better than the solutions of human programmers, it’s already raised some questions about what this means for the future ahead.
[🔬 Interesting Stuff] You have to see AlphaCode doing competitive programming!
Here is an awesome visualization of how it works (press the “play” button)https://t.co/oLJA0zvB9H pic.twitter.com/P0Jh2BepNT
— DagsHub (@TheRealDAGsHub) February 15, 2022
‘A Promising New Competitor’
London-based DeepMind, the AI subsidiary of Google’s parent company Alphabet, has already racked up historic milestones, outperforming humans playing chess and Go, and also proving itself better at predicting the ways that proteins fold.
This month, DeepMind announced that it has also developed a system named AlphaCode to compete in programming competitions, evaluating its performance in 10 different programming contests run by competitive programming site CodeForces — each with at least 5,000 different participants.
The results? AlphaCode “placed at about the level of the median competitor,” reported a DeepMind blog post, “marking the first time an AI code generation system has reached a competitive level of performance in programming competitions.”
DeepMind pointed out that real-world companies use these competitions in recruiting — and present similar problems to job candidates in coding interviews.
In the blog post, Mike Mirzayanov, founder of CodeForces, was quoted as saying that AlphaCode’s results exceeded his expectations. He added, “I was skeptical because even in simple competitive problems it is often required not only to implement the algorithm but also (and this is the most difficult part) to invent it.
“AlphaCode managed to perform at the level of a promising new competitor. I can’t wait to see what lies ahead!”
A paper by DeepMinds researchers acknowledges that it took a tremendous amount of computing power. A petaFLOP signifies a whopping 1,000,000,000,000,000 floating-point operations per second. A petaflop day maintains that pace for every second of a 24-hour day, for a total of roughly 86,400,000,000,000,000,000 operations.
“Sampling and training from our model required hundreds of petaFLOPS days.”
A footnote added that the Google data centers running those operations “purchase renewable energy equal to the amount consumed.”
How AlphaCode Works
The researchers explain their results in a 73-page paper (as yet unpublished or peer-reviewed). The authors write that their system was first “pre-trained” on code in public GitHub repositories, just like the earlier AI-powered code-suggestion tool Copilot. (To avoid some of the controversies that arose around Copilot’s methodology, AlphaCode filtered the datasets it trained on, selecting code that was released under permissive licenses.)
The researchers then “fine-tuned” their system on a small dataset of competitive programming problems, solutions and even test cases, many of which were scraped directly from the CodeForces platform.
One thing they discovered? There’s a problem with the currently available datasets of problems and solutions from programming competitions. At least 30% of those programs pass all the test cases — but are not actually correct.
So the researchers created a dataset that includes more test cases to rigorously check for correctness, and they believe it substantially reduces the number of incorrect programs that would still pass all the tests — from 30% to just 4%.
When the time comes to finally compete on programming challenges, “we create a massive amount of C++ and Python programs for each problem,” the DeepMind blog post stated. “Then we filter, cluster, and rerank those solutions to a small set of 10 candidate programs that we submit for external assessment.”
“The problem-solving abilities required to excel at these competitions are beyond the capabilities of existing AI systems,” argued DeepMind’s blog post, crediting “advances in large-scale transformer models (that have recently shown promising abilities to generate code)” combined with “large-scale sampling and filtering.”
The blog post makes the case that the researchers’ results demonstrate deep learning’s potential even for tasks that require critical thinking — expressing solutions to problems in the form of code. DeepMind’s blog post described the system as part of the company’s mission “to solve intelligence” (which, its website described as “developing more general and capable problem-solving systems” — also known as artificial general intelligence).
The blog post added, “[W]e hope that our results will inspire the competitive programming community.”
Human Programmers React
DeepMind’s blog post also includes comments from Petr Mitrichev, identified as both a Google software engineer and a “world-class” competitive programmer, who was impressed that AlphaCode could even make progress in this area.
“Solving competitive programming problems is a really hard thing to do, requiring both good coding skills and problem-solving creativity,” Mitrichev said.
Mitrichev also supplied commentary for six of the solutions, noting several submissions had also included “useless-but-harmless” chunks of code.
In one submission AlphaCode declared an integer-type variable named x — then never used it. In another graph-traversing submission, AlphaCode needlessly sorted all the adjacent vertices first (by how deep into the graph they’ll lead). For another problem (requiring a computation-intensive “brute force” solution), AlphaCode’s extra code made its solution 32 times slower.
In fact, AlphaCode often simply implemented a massive brute-force solution, Mitrichev wrote.
But the AI system even fails like a programmer, Mitrichev noted, citing one submission where when the solution eluded it, AlphaCode “behaves a bit like a desperate human.” It actually wrote code that simply always delivers the same answer that was provided in the problem’s example scenario, he wrote, “hoping that it works in all other cases.”
“Humans do this as well, and such hope is almost always wrong — as it is in this case.”
AlphaCode as a dog speaking mediocre English https://t.co/WMq7oHNZ5s
— Hacker News (@newsycombinator) February 6, 2022
So just how good were AlphaCode’s results? CodeForce calculates a programmer’s rating (using the standard Elo rating system also used to rank chess players) — and AlphaCode achieved a rating of 1,238.
But what’s more interesting is where that rating appears on a graph of all programmers competing on CodeForce over the last six months. The researchers’ paper noted that AlphaCode’s estimated rating “is in the top 28% among these users.”
Not everyone was impressed. Dzmitry Bahdanau, an AI researcher and adjunct professor at McGill University in Montreal, pointed out on Twitter that many CodeForce participants are high-school or college students — and that the time constraints on their problem-solving have less impact on a pre-trained AI system.
But most importantly, AlphaCode’s process involves filtering a torrent of AI-generated code to find one that actually solves the problem at hand, so “the vast majority of the programs that AlphaCode generates are wrong.”
So while it’s a promising direction to explore, Bahdanau doesn’t feel it’s a programming milestone: “This is not AlphaGo in terms of beating humans and not AlphaFold in terms of revolutionizing an entire field of science. We’ve got work to do.”
AI isn’t coming for your dev job https://t.co/DCIkvqRfdL
— TNW (@thenextweb) February 14, 2022
But where does this lead? Right before their paper’s conclusion, the AlphaCode researchers added two sentences noting the dystopian possibility that code-generating capabilities “could lead to systems that can recursively write and improve themselves, rapidly leading to more and more advanced systems.”
Their paper also calls out another dire possibility: “increased supply and decreased demand for programmers.”
Fortunately, there are already some historical precedents for how this will play out, and the paper argues that “previous instances of partially automating programming (e.g. compilers and IDEs) have only moved programmers to higher levels of abstraction and opened up the field to more people.”
Among at least a few programmers, this has already provoked some concern. Recently a programming student on Hacker News complained of “AlphaCode Anxiety” (as well as worries about GitHub’s Copilot). “Now it feels like I’m running against a clock until the career I am working very hard for will automate itself away,” the student wrote.
When a blog post at CodeForces declared “The future has arrived,” one worried programmer even argued that “there is a limit to what humans should automate.” The programmer added pointedly that the DeepMind developers who built AlphaCode “think that they are irreplaceable, but they would be the first ones to get replaced.”
But the fact that AlphaCode finished in the bottom half was also greeted with a very human disparagement.
“AI is such a noob,” the first commenter responded.