This is a slightly modified version of my original German-language article first published on heise.de under a CC-by 4.0 license.
GitHub is currently causing a lot of commotion in the Free Software scene with its release of Copilot. Copilot is an artificial intelligence trained on publicly available source code and texts. It produces code suggestions to programmers in real time. Since Copilot also uses the numerous GitHub repositories under copyleft licences such as the GPL as training material, some commentators accuse GitHub of copyright infringement, because Copilot itself is not released under a copyleft licence, but is to be offered as a paid service after a test phase. The controversy touches on several thorny copyright issues at once. What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.
Copyleft does not benefit from tighter copyright laws
Copyleft licences are an ingenious invention with which the Free Software scene has used copyright, the sharp sword for the content industry, to promote the free exchange of culture and innovation. Works licensed under copyleft may be copied, modified and distributed by all, as long as any copies or derivative works may in turn be re-used under the same license conditions. This creates a virtuous circle, thanks to which more and more innovations are open to the general public. Copyright, which was designed to guarantee exclusivity over creations, is used here to prevent access to derivative works from being restricted.
However, it is also clear that there would be no need for copyleft licences to govern the exercise of copyright in software code by third-party developers at all if copyright did not guarantee rightsholders such a high degree of exclusive control over intellectual creations in the first place. If it were not possible to prohibit the use and modification of software code by means of copyright, then there would be no need for licences that prevent developers from making use of those prohibition rights (of course, free software licenses would still fulfil the important function of contractually requiring the publication of modified source code). That is why it is so absurd when copyleft enthusiasts argue for an extension of copyright. Any extension of prohibition rights not only strengthens the enforcement of copyleft licences, but also the much more widespread copyright licences, which aim to achieve exactly the opposite results.
But this is exactly what is happening in the current debate about GitHub’s Copilot. Because a large company – namely GitHub’s parent company Microsoft – profits from analyzing free software and builds a commercial service on it, the idea of using copyright law to prohibit Microsoft from doing say may seem obvious to copyleft enthusiasts. However, by doing so, the copyleft scene is essentially demanding an extension of copyright to actions that have for good reason not been covered by copyright. These extensions would have fatal consequences for the very open culture which copyleft licences seek to promote.
There are two main versions of the criticism levelled at GitHub for starting Copilot. Some are criticising the very use of free software as source material for a commercial AI application. Others focus on Copilot’s ability to generate outputs based on the training data. One may find both ethically reprehensible, but copyright is not violated in the process.
Text & data mining is not copyright infringement
To the extent that merely the scraping of code without the permission of the authors is criticised, it is worth noting that simply reading and processing information is not a copyright-relevant act that requires permission: If I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright. The fact that scraping content to train an artificial intelligence enters the realm of copyright at all is because digital technology requires making copies of content in order to process it. Copying is fundamentally a copyright-relevant act. Many of the conflicts between copyright and digital technology result from this fact. Fortunately, policymakers and courts have long recognised that digital technology would be completely unusable if every technical copy required permission. Otherwise, people who listen to music with digital hearing aids would first have to acquire a licence for it. Internet providers would have to license every conceivable copyright-protected work that their customers exchange with each other.
As early as 2001, the EU allowed such temporary, ephemeral acts of copying, which are part of a technical process, without restriction – despite the protests of the entertainment industry at the time. Unfortunately, this copyright exception of 2001 initially only allowed temporary, i.e. transient, copying of copyright-protected content. However, many technical processes first require the creation of a reference corpus in which content is permanently stored for further processing. This necessity has long been used by academic publishers to prevent researchers from downloading large quantities of copyrighted articles for automated analysis. Although these scholars had legal access to the content, for example through a subscription from their university, the publishers tried to contractually or technically exclude the creation of reference corpora. According to the publishers, researchers were only supposed to read the articles with their own eyes, not with technical aids. Machine-based research methods such as the digital humanities suffered enormously from this practice.
Under the slogan “The Right to Read is the Right to Mine”, EU-based research associations therefore demanded explicit permission in European copyright law for so-called text & data mining, that is the permanent storage of copyrighted works for the purpose of automated analysis. The campaign was successful, to the chagrin of academic publishers. Since the EU Copyright Directive of 2019, text & data mining is permitted. Even where commercial uses are concerned, rightsholders who do not want their copyright-protected works to be scraped for data mining must opt-out in machine-readable form such as robots.txt. Under European copyright law, scraping GPL-licensed code, or any other copyrighted work, is legal, regardless of the licence used. In the US, scraping falls under fair use, this has been clear at least since the Google Books case.
Machine-generated code is not a derivative work
Some commentators see GitHub Copilot as a copyright infringement because the programme not only uses copyright-protected software code, a lot of which is published under GPL, as training material, but also generates software code as output. According to critics, this output code is a derivative work of the training data sets because the AI would not be able to generate the code without the training data. In a few cases, Copilot also reproduces short snippets from the training datasets, according to GitHub’s FAQ.
This line of reasoning is dangerous in two respects: On the one hand, it suggests that even reproducing the smallest excerpts of protected works constitutes copyright infringement. This is not the case. Such use is only relevant under copyright law if the excerpt used is in turn original and unique enough to reach the threshold of originality. Otherwise, copyright conflicts would constantly arise when two authors use the same trivial statement independently of each other, such as “Bucks beats Hawks and advance to the NBA finals”, or “i = i+1”. The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality. Precisely because copyright only protects original excerpts, press publishers in the EU have successfully lobbied for their own ancillary copyright that does not require originality as a precondition for protection. Their aim is to prohibit the display of individual sentences from press articles by search engines. It is precisely this problematic demand that the Free Software community endorses when it demands absolute control over the smallest excerpts of software code.
On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either. The output of a machine simply does not qualify for copyright protection – it is in the public domain. That is good news for the open movement and not something that needs fixing.
Those who argue that Copilot’s output is a derivative work of the training data may do so because they hope it will place those outputs under the licensing terms of the GPL. But the unpleasant side effect of such an extension of copyright would be that all other AI-generated content would henceforth also be protected by copyright. What would then stop a music label from training an AI with its music catalogue to automatically generate every tune imaginable and prohibit its use by third parties? What would stop publishers from generating millions of sentences and privatising language in the process?
At the World Intellectual Property Organization (WIPO), companies are already lobbying for an extension of copyright to machine-generated works. According to WIPO: “The main focus of those questions is whether the existing IP system needs to be modified to provide balanced protection for machine created works and inventions”, the main beneficiaries of such an extension of copyright would be the major technology corporations that are best placed to develop and scale AI applications. Such as Microsoft. Critics of GitHub’s business practices would do well not to play into their hands.
This work is licensed under a Creative Commons Attribution 4.0 International License.
This is all fine and dandy, and I agree with everything, but Copilot doesn’t work the way you think or the way it is advertised.
Lots of people have demonstrated that it pretty much regurgitates code verbatim from codebases with abandon. Putting GPL code inside a neural network does not remove the license if the output is the same as the input.
A large portion of what Copilot outputs is already full of copyright/license violations, even without extensions.
Hi, nice article.
In the article you often use the word like “training of the AI” or some variations of it. This hides one big misunderstanding: this is not actually “intelligent”, this is a statistically programmed software.
It is not intelligent it is only programmed to output the best results based statistically on the data it is used to program the software.
I invite you to further research this explanation of statistically programmed software. When in the future we will really have AI it will be another story.
Have a nice day
Francesco
Hi Julia,
While I concur with the pyrrhic victory argument for copyleft entities, I’m rather confused with your public domain interpretation of the output of a machine-learning algorithm.
Saying it would be public domain implies “strong copyleftness”, thus any code written with help of modern IDEs (Integrated Development Environment) and modern text editor, would also need to (at least) be legally classified, probably sometimes as public domain. Since most code written today is still closed-source (at least to my knowledge), how would “public domainness” be established after code has been made public WITH a more restrictive licence?
Another point I don’t see adressed, is the use of machine-learning algorithm for enhancing older (audio/video) content. Wouldn’t every licenced work lose it’s property of “derivativeness” and thus be public domain? Even worse, wouldn’t I be able to “filter” every copyrighted work known through an “interpretive AI” and wash out the copyright. What would be the grounds to address this, except the obvious ones of judicial leaniency?
If I were able to cleverly use GitHub Copilot to reproduce a complete GPL program, would I thereby free it from its licence? I see parallels to the “enhancing AI” for copyrighted audio/video content here, am I mistaken?
Sorry for my apparent lack of in-depth EU legislative knowledge. ;-)
Thanks for your great work!
Cheers,
Oliver
While for the most part I agree with what you said here, I think that you missed one very important aspect of this whole affair. Github Copilot can produce large [1] verbatim [2] non transformed [3] chunks of code, that if written by hand would be copyrightable. Still I agree with you that computer generated code is not a work in terms of intellectual property.
But then arises very important question – where is the line of “computer generated”? If i create some kind of database that collects loose functions from publicly available repositories (even with permissive licenses) all around the world, and then create some kind of software that would paste this function if referenced or matched – would it be computer generated, or not? I would want to hear what some judge will say to my testimony that “It was computer generated” if someone sued me for copyright infringement (rightfully so, imo). I don’t believe that the judge will say “Ah, if it was ‘computer generated’ then it’s fine” – I’m not a lawyer however so, just some speculation ;) Now how copilot is different in the essence from such a database? Well, we can argue, that code generated by copilot is highly modified and is easily differentiable from the original piece – that you hopefully cannot tell on which fragment it was based at all. But the problem is that – as mentioned earlier – sometimes it’s not. Sometimes it is literally the same, even down to the famous comments. And the worst thing is we event don’t when using how much verbatim it is. I know one thing for sure – I would not want to be sued by the oracle because they found some of their (open source, but licensed) code in my repository that was “generated” by the copilot.
And we have other software that is based on the same rules as github’s copilot is – let’s take Tabnine [4] for example – there are others, but I only used this one. I don’t really know how they created and trained their model, I know that it relies mainly on transfer learning and I know that it works and is helpful. I however don’t think that it shares the same problems, because chunks of code generated by the Tabnine are so small that – as you correctly noticed in this post – cannot be viewed as something copyrightable.
And so, this is a thing that in my opinion is very important, and was not really addressed in this article. I’d love to hear your response – maybe I had missed something when reading?
[1]: https://docs.github.com/en/github/copilot/research-recitation#github-copilot-mostly-quotes-in-generic-contexts
[2]: https://twitter.com/mitsuhiko/status/1410886329924194309
[3]: https://twitter.com/kylpeacock/status/1410749018183933952
[4]: https://www.tabnine.com/
Copyleft does not just invert copyrights. It attacks trade secrets, using copyrights as leverage. Hence the requirements to disclose source when providing binaries (GPL) or services (AGPL, OSL), in addition to requirements to license alike.
Stronger copyright affords copyleft more leverage on trade secrecy. If the copyright laws did not apply to software, we would expect many more companies to withhold source code as trade secrets and barter access for contractual obligations of secrecy and limited use, under nondisclosure agreements.
That is largely what has happened with hosted software services. You cannot see the source code to Facebook or Google or Twitter or the like, so it doesn’t matter whether copyright protects it. That’s possible now in part because early copyleft licenses were written on the assumption of a weak copyright law, and weren’t forcefully strengthened by advocates when that gap became apparent.
So, I can create a “slightly change stuff under the hood” and advertise it as AI, then use all the juicy free GPL code? Nice. Does github shows the code of its co-pilot?
I think your article is slightly missing the point as the internet has already pointed out that GitHub’s Copilot is indeed originally reproducing a new manifestation of Quake III’s fast inverse square root algorithm: https://twitter.com/mitsuhiko/status/1410886329924194309
This is not a fragmented reproduction. It’s the original algorithm that was released and licensed under the GPL by id-software. Permalink here: https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb10315479fc00086f08e25d968b4b43c49/code/game/q_math.c#L552
I don’t disagree with most of your statements. I’m not a lawyer. But it seems that this example was not addressed in your post.
I mean, if this is now legal and there’s not a qualitative determination for machines to produce “non-copyrightable” content into the public domain – then what’s stopping me from overfitting a machine learning algorithm on the Beatles catalog and reproduce all songs with 99% overlap with my machine for the public domain?
Please educate yourself!
https://blog.hrithwik.me/the-good-and-the-limitations-of-github-copilot
Read this section very careful
Weird Copyright messages
Hi,
I don’t mind software, like co-pilot, reading any kind of open source code (copy left or not). That includes “learning” from it as in understanding what’s “good” code.
It doesn’t include “learning” how to freely copy and share any parts of open source though.
I find it very critical is that outputs code snippets which are indistinguishable from a human copying lines from a open source repository with a copyright statement.
If that’s ok, why would anyone respect any source code copyright anymore? You can just claim that a machine copied the pieces and it’s fine.
To my understanding, you are implying that copyright only applies to the repository as a whole. Not parts of it. Right?
Meaning, copyright has no effect for any repositories containing a set of utilities which don’t need to be used as a whole. Indirectly declaring all those public domain. Without any need to attribute or follow other the license restrictions like copy left.
Or what am I missing here?
Best,
Micha
Micha wrote:
> If that’s ok, why would anyone respect any source code copyright anymore? You can just claim that a machine copied the pieces and it’s fine.
While Copilot produces the code, the developer who uses Copilot is responsible at the end when (s)he commits it and publishes it.
Your argument is flawed because you do not understand how current neural networks work.
The neural network weights encode the “scraped” information. Sometimes they are a generalization of what was scraped (and you might be able to speak of “computer generated code”). But often they are not. They are only a different encoding of the original information. The Quake inverse square root code that Copilot reproduced verbatim is clearly unique and original enough to deserve copyright protection since very few people could write it. Even with generalizations, the question is not as simple as you make it out to be. A compressed video is also a generalization of the original source material (details have been dropped to compress it). Yet no copyright protection is lost in that case.
Hello, @eevee here. To be clear, I’m not arguing for an expansion of copyright; I’m merely arguing that what GitHub is doing (and facilitating) sucks ass. Whether it actually infringes my copyright is almost immaterial, because even if it did without a shadow of a doubt, what am I going to do about it? Sue Microsoft? Sue every developer who uses Copilot, somehow?
Regarding your core point: I am also not arguing that reproducing short snippets is copyright infringement. In fact I think that’s a huge red herring and I don’t understand why everyone is so fixated on verbatim reproductions. If I take two novels and interleave their sentences, the resulting output (which could be machine-generated, no less!) would never have more than a few consecutive words in common with either of the originals, but you couldn’t argue with a straight face that it’s not derived from them.
I would honestly prefer if ML were considered fair use, though there’s some serious line-drawing there that everyone is frantically avoiding talking about. My objection in this particular case is that GitHub has taken a large body of code explicitly licensed to prevent its exploitation by commercial and proprietary interests, chucked it in a blender, and offered the resulting smoothie for exploitation by commercial and proprietary interests. “But a computer did it” is irrelevant.
Could the generator produce metadata about where the content it produces originates from. In the sense of giving credit where credit is due. For automated scraping it should not be difficult to include link+name of author to any scraped snippets of code?
Way back when copyright policies were created, machine learning wasn’t a concern. It was targeted at people, but this is on another scale. What we really need is a new set of policies for machine learning, something like “learningright”. Trying to stretch copyright laws for machine learning is wrong. There are so many nuances and precedents that copyright is no longer applicable. Moreover, it is not the “theft” that makes everyone so anxious, but the nightmarish thought that Copilot is just a stepping stone for “Cortana, make an app that does X”.
It may not be illegal, but Copilot, being owned, closed source, only available on VS Studio, and without any way to have your code opt-out; Copilot is immoral.
If Copilot was Open Sourced, the ML corpus and models publicly available and not tied to VS Studio with an community governance, it could be a good thing.
Finally, there is a GitHub Copilot, but for music. https://fairuseify.ml
> What would then stop a music label from training an AI with its music catalogue to automatically generate every tune imaginable and prohibit its use by third parties?
The example makes no sense because the generated content is generated from already copyrighted content, so it’s derivative work and basically you’d need an agreement with every author in the original dataset if you wanted to publish music done this way.
But under your interpretation that’s exactly what one might be able to do.
Nobody wants to make copyright stronger, the problem is ignoring the current copyright to advantage large companies when the opposite is not going to happen.