Generative AI – Prolific Copyright Infringer?

poor man's copyright.  Original music mailed to myself (Jason Flaks) via certified mail.

So, you might be wondering, “What makes this guy fit to pen an article about generative AI and copyright infringement?” I mean, I’m no copyright lawyer, nor do I moonlight as one on television. But I do bring a unique viewpoint to the table.  After all, I’ve dabbled in the music industry and have been up to my elbows in machine-learning projects for a good chunk of my career. But perhaps my best qualification is my long-standing fascination with copyright law, which started when I was just a kid. That image up top isn’t some AI-generated piece from Midjourney; they’re my earnest attempts at copyrighting my original music over three decades ago using the Poor Man’s copyright approach.

So, What’s Copyright, and How Do I Get One?

Before we dive into the meaty debate of whether Generative AI infringes on anyone’s copyright, let’s clarify what copyright means. According to the US Copyright Office, copyright is a “type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression.” In simpler terms, copyright asserts your ownership of any intangible creations of human intellect (e.g., music, art, writing, etc).  The moment you fix your creation to a physical form (e.g., MP3 file, canvas, video recording, piece of paper, etc.), you have a copyright.

Amazingly, you don’t need to register for an official copyright to have one. So, why register a copyright? The Supreme Court has decided that to sue for copyright infringement, you must have registered your copyright with the US copyright office. However, they’ve also clarified that registering to sue is separate from the date of creation. This means I could register for a copyright for this blog post years from now but still sue for any infringement prior, as long as I can prove the date of creation.

How Can Generative AI Infringe on a Copyright?

There’s been a lot of talk about generative AI infringing on copyright protections, but many of these discussions oversimplify the issue. There are actually three different ways Generative AI can infringe on your copyright, some favoring artists/creators and some more favorable to the generative AI companies.

  • Theft (copying) of copyrighted material
  • Distribution of copyrighted material
  • Use of copyrighted material in derivative or transformative works

On Theft (Copying) of Copyrighted Material

Let’s get real; while there are numerous court cases establishing the legitimate right to duplicate copyrighted material under the fair use doctrine, the default assumption is and should be that it is illegal to do so.  Therefore, if we can determine that generative AI companies are using copies of content they did not pay for or get permission to use and using that content in a way that falls outside fair use, then we can assume they are stealing copyrighted material.

There is little or no debate that generative AI companies are using copyrighted material.  After all, OpenAI basically admits to lifting its training data from content “that is publicly available on the internet” on its website. And as I discussed earlier just about anything newly written on the internet has an inherent copyright.  But beyond possible scraping of my blog post, there is sufficient evidence that generative AI companies have ingested copyrighted books, images, and more. 

And if you need any more proof, look at the image below where I attempted to elicit lyrics from Bob Dylan’s Blowin’ in the Wind from ChatGPT.  ChatGPT both knew the lyrics I provided were from the song, and ChatGPT was able to quote a portion of the lyrics I did not provide.  It can only do that because it has seen the lyrics before in its training data set.

ChatGPT prompted to generate lyrics from Bob Dylan's "Blowin in teh Wind"

If there is no question that copyrighted material was used in the training process, then we only need to assess whether the copying should be considered fair use. There are multiple justifications for fair use in copyright law.  Some are easy to interpret, and others more difficult.  Items like research or scholarly use are reasonably easy to assess, and I can find no fair argument that generative AI companies are using copyrighted material in either capacity.

So, that leaves the last question in fair use: does the copying materially impact the monetization of the content?   And I think the answer here again is quite simple: YES!  The easiest example I can give is the artwork I regularly use in my blog posts.  I’ve traditionally paid for the art I use via services like Dreamstime.  If Midjourney or Stable Diffusion trained on this type of art and I subsequently generate my blog post art via their services, I may never pay for art via Dreeamstime or other similar services again.  And in doing so, those artists have lost a way to monetize their art, and they are not equally compensated by the generative AI companies.

On Distribution of Copyrighted Material

If you’re old like me, you may remember those FBI copyright warnings that regularly made an appearance on DVDs and VHS tapes.

The unauthorized reproduction or distribution of this copyrighted work is illegal …

The issue of whether these systems distribute the content in its original form with little transformation is a big one. This distribution can occur in two ways: to end customers and to data annotators.

To end customers

Generative AI models are basically next-word (pixel, etc.) predictors. They aim to provide the most statistically likely next word based on a previous sequence of words. As a result, these models will, without any special adaptations, spit back exact copies of text, images, etc., especially in low-density areas.  As you can see from the image in the previous section, while OpenAI has been proactively trying to adapt the system not to distribute copyrighted material, I was still able to get it to do so with very little effort on my part.

So while these generative AI systems will continue to try and put mitigations in place to prevent the distribution of copyrighted content, there is little or no debate that they have been doing so all along.  And they are likely to continue doing so as it is impossible to close every hole in the system.

To data annotators

OpenAI and others use reinforcement learning from human feedback (RLHF) to improve their models. RLHF requires that outputs from an original model are shown to human annotators to help build a reward model that leads to better outputs from the generative model. If these human annotators were shown copyrighted material, in an effort to reward the model for not doing so in the future, OpenAI and other generative AI companies would clearly be distributing copyrighted content.

You might ask, “Shouldn’t copyright holders be happy that OpenAI is trying to train their models not to distribute copyrighted content?”  Well, maybe, but if I started traveling the country tomorrow, giving a for-profit seminar on how to detect illegal copies of the Super Bowl, and in these seminars, I played previous Super Bowl recordings to the attendees without the NFL’s permission … I think the NFL would have a problem with that.

On Use of Copyrighted Material in Derivative or Transformative Works

The question of whether the output generated by Generative AI models, when not a direct reproduction, counts as copyright infringement is a murky one. There are many examples where courts have determined that “style” is not copyrightable. There are further questions on whether any output created by generative AI based on copyrighted material is derivative or transformative.   Truth be told, it can likely be either, depending on how the model is prompted.  So it’s actually quite difficult to say for sure if the resulting output from generative AI models is fair use or copyright infringement.

We’re left then with questions about who is really violating copyright in any of these cases. Is it the model or the company that owns it? Or Is it the user who prompted the model to generate the content? And does any of it really matter unless that generated content is published?

The Road Ahead

It seems to me the issue of generative AI and copyright has been complicated more than necessary. Generative AI companies must find a way to pay for the content they use to train their models. If they distribute the content, they may need to find a way to pay royalties.  Otherwise, these generative AI companies are profiting off the works of creators without properly compensating them.  And that just isn’t fair.

For artists, don’t let the thought of generative AI copying your style without compensation scare you. These models can’t generate new content and are limited to what they’ve seen in their training set. So, keep making new art, keep pushing boundaries, and if we solve the first problem of content theft and distribution, you’ll continue to be paid for the amazing work you create.