January 28, 2025

Meta’s LibGen Use Exposes Legal Peril for AI Training on Copyrighted Content

The recent unredaction of court documents in the Kadrey et al. v. Meta Platforms case has revealed a bombshell: Meta, the parent company of Facebook, Instagram, and WhatsApp, allegedly used pirated books from the shadow library Library Genesis (LibGen) to train its AI models.

This revelation not only puts Meta in legal hot water but also raises serious questions about the legality of AI training practices across the tech industry.

Why Did Meta Use Pirated Content?

While Meta has not officially commented on the matter, the unredacted documents suggest that the company was aware of the pirated nature of the LibGen dataset. Internal communications reveal that Meta employees, including CEO Mark Zuckerberg, were involved in discussions about using LibGen, raising concerns about potential legal and reputational risks.

Despite these concerns, Meta seemingly proceeded with using the pirated data, likely due to the vast amount of text it provided for training its large language models.

Analysis

Meta’s alleged use of pirated content highlights a critical legal gray area in the field of AI. The company argues that using publicly available data, even if obtained illegally, falls under the “fair use” doctrine. However, the plaintiffs contend that Meta’s actions constitute copyright infringement, especially given the company’s alleged awareness of the data’s illicit origin.

The outcome of this case could have far-reaching implications for the AI industry, potentially setting a precedent for how AI models can be trained and what constitutes acceptable use of copyrighted material.

What Should Enterprises Do?

This case serves as a wake-up call for enterprises developing or utilizing AI. It is crucial to ensure that the data used to train AI models is obtained legally and ethically. Relying on pirated or copyrighted content without proper authorization could lead to legal challenges and reputational damage.

Enterprises should prioritize responsible AI development, investing in legitimate data sources and obtaining necessary licenses or permissions.

Enterprises should verify that any Generative AI tools (Large or Small Language Models) that they license for usage should include an Indemnification clause before they release that AI tool for internal usage.

For vendors, GenAI should not be embedded into products and services – until such time as Indemnification clauses are incorporated into an Master Services Agreement.

Impact on the Market

The unredacted documents have sent shockwaves through the AI community. OpenAI, the creator of ChatGPT, and DeepSeek, another AI research company, could be vulnerable to similar legal challenges if it is proven that they used copyrighted material without permission.

This case could force a reassessment of AI training practices across the industry, potentially leading to increased scrutiny and regulation.

Bottom Line

Meta’s alleged use of pirated content for AI training underscores the legal and ethical complexities surrounding AI development. This case could be a watershed moment for the industry, prompting a shift towards more responsible and transparent AI practices. Enterprises must prioritize ethical data sourcing and ensure compliance with copyright laws to avoid potential legal pitfalls.

UPCOMING WEBINAR

Navigate Disruption and Outperform Your Goals with Aragon Research’s 2025 Q1 Agenda

Join us for a complimentary webinar at 10 AM PT | 1 PM ET on Thursday, January 30, 2025, featuring Aragon Research CEO and Lead Analyst, Jim Lundy.

Jim will unveil Aragon’s Q1 2025 research agenda, designed to help you understand the technologies that will impact your business.
In this webinar, you’ll discover:

Aragon’s coverage areas, analysts, and services
Research offered by Aragon, including Visual Research
The research agenda for Q1 2025
Sneak peek at what’s coming in Q2 2025

Tune in to see what’s in store for 2025!

Meta’s LibGen Use Exposes Legal Peril for AI Training on Copyrighted Content