Development / Machine Learning / Open Source

This Week in Programming: GitHub Copilot, Copyright Infringement and Open Source Licensing

3 Jul 2021 6:00am, by

Earlier this week, GitHub introduced GitHub Copilot, a new feature that it is referring to as “your AI pair programmer” but might also be appropriately called “IntelliSense on steroids.” Built using OpenAI Codex, a new system that the company says is “significantly more capable than GPT-3 in code generation,” the tool not only autocompletes lines of code but will offer entire blocks of code in response to both code that you type and natural language.

Having been “trained on billions of lines of public code,” one of the first questions that has come up regarding Copilot has focused on issues of copyright, specifically pointing to the idea of the viral GPL license, which requires that all derivative works carry that same license.

Now, while there is plenty of conversation floating around on Twitter and a few Hacker News threads, most of it, as you might expect, falls under the “I am not a lawyer” disclaimer. There is one Hacker News comment, from GitHub CEO Nat Friedman, however, that offers a bit of a response to questions along these same lines.

“In general,” writes Friedman, “(1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.” He then offers a link to OpenAI’s position on training machine learning models, which argues that “training AI systems constitutes fair use” and furthermore that “policy considerations underlying fair use doctrine support the finding that training AI systems constitute fair use.”

Well, of course, we thought you might say something like that, Nat.

But Friedman is not alone — a couple of actual lawyers and experts in intellectual property law took up the issue and, at least in their preliminary analysis, tended to agree with Friedman. First, Neil Brown examines the idea from an English law perspective and, while he’s not so sure about the idea of “fair use” if the idea is taken outside of the U.S., he points simply to GitHub’s terms of service as evidence enough that the company can likely do what it’s doing. Brown points to passage D4, which grants GitHub “the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time.”

“The license is broadly worded, and I’m confident that there is scope for argument, but if it turns out that Github does not require a license for its activities then, in respect of the code hosted on Github, I suspect it could make a reasonable case that the mandatory license grant in its terms covers this as against the uploader,” writes Brown. Overall, though, Brown says that he has “more questions than answers.”

In a more definitive take, Andres Guadamuz, a senior lecturer in intellectual property law at the University of Sussex and the Editor in Chief of the Journal of World Intellectual Property, takes up the question of whether or not GitHub Copilot is infringing copyright, concluding that “this is neither copyright infringement nor license breach, but I’m happy to be convinced of the contrary.”

On the idea of copyright infringement, Guadamuz first points to a research paper by Alber Ziegler published by GitHub, which looks at situations where Copilot reproduces exact texts, and finds those instances to be exceedingly rare. In the original paper, Ziegler notes that “when a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from,” as a solution against infringement claims.

On the idea of the GPL license and “derivative” works, Guadamuz again disagrees, arguing that the issue at hand comes down to how the GPL defines modified works, and that “derivation, modification, or adaptation (depending on your jurisdiction) has a specific meaning within the law and the license.”

“You only need to comply with the license if you modify the work, and this is done only if your code is based on the original to the extent that it would require a copyright permission, otherwise it would not require a license,” writes Guadamuz. “As I have explained, I find it extremely unlikely that similar code copied in this manner would meet the threshold of copyright infringement, there is not enough code copied, and even if there is, it appears to be mostly very basic code that is common to other projects.”

While Copilot definitely appears to spit out verbatim code once in a while, it is the infrequency of that occurrence that seems to assure Guadamuz that the tool is in little jeopardy for being successfully litigated against. In one comment on his article, he writes that “this is all going to be solved eventually by Codex an Copilot offering a similarity tool where programmers can check whether there is any recitation in their code,” which might help with scenarios such as this:

And while we’re here, if copyright infringement and open source licensing is less of a concern for you, and you’re more interested in just how cool and useful a tool like GitHub Copilot might be, make sure to head on over and read Darryl Taft’s analysis of Copilot, which he calls “A Powerful, Controversial Autocomplete for Developers“.

This Week in Programming

  • “Uncomfortable Questions” on Windows’ Android Apps: Last week, we wrote about how Microsoft was yet again causing the netherworld to plunge below freezing with its addition of Android apps to Windows. Something about Microsoft’s use of the Amazon AppStore for Android, and not Google’s Play Store, felt a bit… suspicious… and Android developer, author, and developer Mark Murphy points out why in his blog post on “Windows 11, Amazon, and Uncomfortable Questions“. In his post, Murphy writes that “Amazon pioneered the ‘replace the developer signature’ approach that Google uses with App Signing. And, Amazon does so specifically to be able to modify every Android app that they distribute,” pointing out that this might go beyond collecting analytics, which is troublesome enough, but to also perhaps modify apps to bypass end-to-end encryption, for example. Uncomfortable questions, indeed.

  • Docker Desktop 3.5: The latest version of Docker Desktop, Docker Desktop 3.5, has arrived with improved volume management, Docker Dev Environments and more. In addition to the Dev Environments, which we reviewed last week, Docker Desktop 3.5 gives users the ability to better manage files in their volumes, and continues the roll-out of Docker Compose V2 Beta, which brings the compose command to the Docker CLI. In addition to all of that, Docker Desktop 3.5 also adds a warning for images that are incompatible with Apple Silicon machines, as well as takes a hint from user feedback, making requests for feedback just a little “less disruptive”. To check it all out, download or update to Docker Desktop 3.5.
  • Quarkus 2.0 Drops: Red Hat has launched Quarkus 2.0, highlighting amongst the new features continuous testing, DevServices, a new Developer UI and Developer CLI, as well as improved performance with “lighting-fast” RESTEasy Reactive. Quarkus is Red Hat’s Kubernetes Native Java framework tailored for OpenJDK HotSpot and GraalVM, or what it refers to as “Supersonic Subatomic Java”, and this move to 2.0 “signals a new level of maturity for the project,” they write. To find out all that’s new, head on over to the Quarkus 2.0 launch site or to code.quarkus.io to give it a try.

Feature photo by Westwind Air Service on Unsplash

A newsletter digest of the week’s most important stories & analyses.