Meta Faces Another Lawsuit for Using Copyright-Protected Works To Train Its AI Models

The company reportedly pushed ahead with its LLM development despite knowing it was breaking copyright law.

Meta Faces Another Lawsuit for Using Copyright-Protected Works To Train Its AI Models

For a company that has over 3 billion active users, and the never-ending stream of data that comes from that, it’s a wonder why Meta needs to rely on such massive troves of external data to power its AI tools.

In any event, with the company facing a significant legal challenge in the U.S. over the unauthorized use of copyright-protected material to train its Llama model, Meta has also been hit with another copyright challenge, this time in France, where French publishers have also launched legal action for copyright infringement.

As reported by Bloomberg:

French publishers and authors are suing Meta for copyright infringement, accusing the tech giant of using their books to train its generative artificial intelligence model without authorization. SNE, the trade association representing major French publishers including Hachette and Editis, along with authors’ association SGDL and writers’ union SNAC, filed a complaint this week in a Paris court dedicated to intellectual property, the group said at a press conference on Wednesday.”

It seems that, much like the American collective seeking to hold Meta to account for illegally using their works, French publishers have also found the same, that Meta’s AI models are able to produce highly accurate replicas of their authors’ work, signaling likely scraping and theft of their intellectual property.

Which likely stems from the same AI development push at the company.

According to reports, following the rise of OpenAI back in 2022, Meta CEO Mark Zuckerberg was desperate to catch up, and build a rival AI model that would ensure that Meta remained the leader in the AI race.

Within this, Zuckerberg reportedly approved the use of what Meta knew was copyright-protected material in order to build out its language model.

As reported by The New York Times:

Meta could not match ChatGPT unless it got more data. Some debated paying $10 a book for the full licensing rights to new titles. They discussed buying Simon & Schuster, which publishes authors like Stephen King, according to the recordings. They also talked about how they had summarized books, essays and other works from the internet without permission and discussed sucking up more, even if that meant facing lawsuits. One lawyer warned of “ethical” concerns around taking intellectual property from artists but was met with silence, according to the recordings.”

Meta then reportedly did integrate illegally sourced, copyright-protected material, from scraping platforms that it knew were operating in violation of the law.

The problem, according to NYT, was that despite Meta having so many users of its apps, most of the content that they produce isn’t overly helpful in building its AI model, because people delete older posts, people don’t generally post longer content to the app, the writing style doesn’t align with the conversational nature of chatbots, etc.

As such, for Meta to compete, it needed new data sources, and it found it in pirated books. Which publishers have now detected via their own means.

Which could see Meta face a parade of lawsuits around the world, especially if these initial cases lead to compensation deals for the impacted authors.

Indeed, if legal precedent can be established, you can bet that every publishing house in the world will smell the cash, and will be trawling through any info they can find to sniff out traces of their own works.

Which could lead to major penalties for Meta moving forward.

But hang on, how could OpenAI, a much smaller start-up, with no access to billions of users’ info, build out its own database in the same way without the same copyright issues?

Well, it’s also facing various legal challenges for the same.

Indeed, in all of these cases, you can expect to see OpenAI also being investigated for the exact same violation, as authors and publishers seek recourse for unauthorized use.

Data is the arterial power source of large language models, and the company with the best data sources will eventually win out, because their system will produce better, more accurate, more useable results, based on the reference set. Without that initial data source, the systems have nothing to go on, which is seemingly why Meta and OpenAI, and others, were willing to take such risks in building their LLMs.

At the same time, once they’re built, they exist, and you can then train them with supplementary data from there. So Meta may have viewed this as a necessary risk in set-up, which will now enable it to make more use of its own data trove to refine its models.

That’s similar to how xAI is approaching its LLM, building the foundation, then using X posts to refine and revise the model to provide real-time informational updates.

As such, while this may end up costing them, it could be worth it, offset by the benefits they’ll glean from selling their models.

Either way, it could take years for the courts to litigate each case, and by then, there may be a new legal approach to LLM training and the use of such works.

You can bet that Meta’s exploring every angle on this front.