AI companies have all kinds of arguments against paying for copyrighted content

Illustration by Alex Castro / The VergeThe US Copyright Office is taking public comment on potential new rules around generative AI’s use of copyrighted materials, and the biggest AI companies in the world had plenty to say. We’ve collected...

Nov 5, 2023 - 21:01

0 43

AI companies have all kinds of arguments against paying for copyrighted content

The US Copyright Office is taking public comment on potential new rules around generative AI’s use of copyrighted materials, and the biggest AI companies in the world had plenty to say. We’ve collected the arguments from Meta, Google, Microsoft, Adobe, Hugging Face, StabilityAI, and Anthropic below, as well as a response from Apple that focused on copyrighting AI-written code.

There are some differences in their approaches, but the overall message for most is the same: They don’t think they should have to pay to train AI models on copyrighted work.

The Copyright Office opened the comment period on August 30th, with an October 18th due date for written comments regarding changes it was considering around the use of copyrighted data for AI model training, whether AI-generated material can be copyrighted without human involvement, and AI copyright liability. There’s been no shortage of copyright lawsuits in the last year, with artists, authors, developers, and companies alike alleging violations in different cases.

Here are some snippets from each company’s response.

Imposing a first-of-its-kind licensing regime now, well after the fact, will cause chaos as developers seek to identify millions and millions of rightsholders, for very little benefit, given that any fair royalty due would be incredibly small in light of the insignificance of any one work among an Al training set.

If training could be accomplished without the creation of copies, there would be no copyright questions here. Indeed that act of “knowledge harvesting.” to use the Court’s metaphor from Harper & Row, like the act of reading a book ‘and learning the facts and ideas within it, would not only be non-infringing, it would further the very purpose of copyright law. The mere fact that, as a technological matter, copies need to be made to extract those ideas and facts from copyrighted works should not alter that result.

Any requirement to obtain consent for accessible works to be used for training would chill Al innovation. It is not feasible to achieve the scale of data necessary to develop responsible Al models even when the identity of a work and its owner is known. Such licensing schemes will also impede innovation from start-ups and entrants who don’t have the resources to obtain licenses, leaving Al development to a small set of companies with the resources to run large-scale licensing programs or to developers in countries that have decided that use of copyrighted works to train Al models is not infringement.

Sound policy has always recognized the need for appropriate limits to copyright in order to support creativity, innovation, and other values, and we believe that existing law and continued collaboration among all stakeholders can harmonize the diverse interests at stake, unlocking AI’s benefits while addressing concern.

In Sega v. Accolade, the Ninth Circuit held that intermediate copying of Sega’s software was fair use. The defendant made copies while reverse engineering to discover the functional requirements—unprotected information—for making games compatible with Sega’s gaming console. Such intermediate copying also benefited the public: it led to an increase in the number of independently designed video games (which contain a mix of functional and creative aspects) available for Sega’s console. This growth in creative expression was precisely what the Copyright Act was intended to promote.

For Claude, as discussed above, the training process makes copies of information for the purposes of performing a statistical analysis of the data. The copying is merely an intermediate step, extracting unprotectable elements about the entire corpus of works, in order to create new outputs. In this way, the use of the original copyrighted work is non-expressive; that is, it is not re-using the copyrighted expression to communicate it to users.

Over the last decade or more, there has been an enormous amount of investment—billons and billions of dollars—in the development of AI technologies, premised on an understanding that, under current copyright law, any copying necessary to extract statistical facts is permitted. A change in this regime will significantly disrupt settled expectations in this area. Those expectations have been a critical factor in the enormous investment of private capital into U.S.-based AI companies which, in turn, has made the U.S. a global leader in AI. Undermining those expectations will jeopardize future investment, along with U.S. economic competitiveness and national security.

The use of a given work in training is of a broadly beneficial purpose: the creation of a distinctive and productive Al model. Rather than replacing the specific communicative expression of the initial work, the model is capable of creating a wide variety of different sort of outputs wholly unrelated to that underlying, copyrightable expression. For those and other reasons, generative Al models are generally fair use when they train on large numbers of copyrighted works. We use “generally” deliberately, however, as one can imagine patterns of facts that would raise tougher calls.

A range of jurisdictions including Singapore, Japan, the European Union, the Republic of Korea, Taiwan, Malaysia, and Israel have reformed their copyright laws to create safe harbors for Al training that achieve similar effects o fair use.” In the United Kingdom, the Government Chief Scientific Advisor has recommended that “if the government’s aim is to promote an innovative Al industry in the UK, it should enable mining of available data, text, and images (the input) and utilise [sic] existing protections of copyright and IP law on the output of AI.

In circumstances where a human developer controls the expressive elements of output and the decisions to modify, add to, enhance, or even reject suggested code, the final code that results from the developer’s interactions with the tools will have sufficient human authorship to be copyrightable.