Site icon Aragon Research

New York Times Limits Use of Data for Generative AI

By Adam Pease

 

New York Times Limits Use of Data for Generative AI

The New York Times (NYT) recently updated its terms of use to prohibit the use of the newspaper’s content in the training of generative AI models. This blog discusses the news and its implications for the generative AI market.

Generative AI and Data Ownership

The legal situation surrounding generative AI is still in a state of limbo, with many uncertainties regarding industry regulation and copyright law. Several leading generative AI providers have faced criticism for incorporating data into their model training without consulting the original authors or organizations that might own that data. Because large language models require so much text data for training, web scraping tools are often utilized to harvest data without explicit permission from the original source.

This decision by the New York Times represents a landmark in the fight between generative AI providers and the owners of data. It comes in the wake of Google’s announcement that it would be using public web data to train its AI models. Some generative AI providers are also moving in the other direction, offering ways for organizations to opt out of data harvesting for model training.

What’s Next for Generative AI Data?

It’s unclear what the long-term implications of the New York Times decision will be. To be sure, the material in the New York Times constitutes a negligible fraction of training data for LLMs like GPT-4. At the same time, the decision could cause other large media organizations to follow suit. If organizations like Reddit, which was heavily used to train state-of-the-art models were to implement similar restrictions, it could be represent an obstacle to model training.

And yet, it is difficult for any organization to strictly implement control over its data. With the rise of open source model development, many users may disobey terms of use such as those stipulated by the New York Times and scrape data anyway. Another consideration for generative AI providers is the use of synthetic data—now that models like GPT-4 are capable of producing considerable amounts of high-quality text, this could simply be used to train new models instead of real, human data. New York Times

Bottom Line

In summary, the legal landscape surrounding generative AI copyright and data ownership is still muddled. The New York Times’s choice to limit data usage represents one move by the organizations that own content to assert their stake in the development of generative AI.


Catch our LIVE insights on AI – How It’s Impacting Cybersecurity and the Future of Jobs!

Cybersecurity in the Age of AI: Fighting Fire with Fire

Led by Aragon Sr. Director of Research, Craig Kennedy, on Thursday, August 24th!

Register Here

Will AI Take Your Job?

Led by Aragon VP of Research, on Wednesday, August 30th!

Register Here


 

This blog on is part of the Content AI blog series by Aragon Research’s Analyst, Adam Pease.

Missed the previous installments? Catch up here:

Blog 1: RunwayML Foreshadows the Future of Content Creation

Blog 2: NVIDIA Enters the Text-to-Image Fray

Blog 3: Will OpenAI’s New Chatbot Challenge Legacy Search Engines?

Blog 4: Adobe Stock Accepts Generative Content and Meets Backlash

Blog 5: OpenAI Makes a Move for 3D Generative Content with Point-E

Blog 6: ChatGPT and the Problem of Detecting AI-Generated Content

Blog 7: Content AI: Voice AI Takes a Step Forward

Blog 8: AI in the Courtroom: Are Robot Lawyers the Future of Law?

Blog 9: GitHub Copilot and the Legality of Generative Content

Blog 10: Google Steps into the Chat AI Ring with Bard, Anthropic Investment

Blog 11: Exploring Google Bard’s Botched Demo

Blog 12: Meta AI Is Working at the Intersection of Robotics and Generative AI

Blog 13: Meta’s New AI Model Leaks

Blog 14: Students in China Use ChatGPT from Behind the Firewall

Blog 15: OpenAI’s ChatGPT API Will Transform Application Experiences

Blog 16: Microsoft Announces Copilot X, GPT-4 Integration

Blog 17: BloombergGPT Brings Generative AI to Finance

Blog 18: Stability AI Releases Its First Large Language Model: StableLM

Blog 19: OpenAI to Patent ‘GPT’

Blog 20: Pinecone and the Power of Vector Databases for AI

Blog 21: Alphabet Plans New Generative AI Announcements for Google I/O

Blog 22: Europe Moves to Regulate Generative AI

Blog 23: OpenAI Introduces Code Interpreter Plugin for ChatGPT

Blog 24: Generative AI and the Labor Market: Is It Causing Job Loss?

Blog 25: OpenAI Announces Function Calling for Its GPT-4 API

Blog 26: The State of Open-Source Language Models

Blog 27: The State of Generative Video

Blog 28: Google’s “Genesis”: A News Writing AI Shocking Journalists

Blog 29: OpenAI Brings Custom Instructions to ChatGPT

Exit mobile version