August 14, 2023

New York Times Limits Use of Data for Generative AI

The New York Times (NYT) recently updated its terms of use to prohibit the use of the newspaper’s content in the training of generative AI models. This blog discusses the news and its implications for the generative AI market.

Generative AI and Data Ownership

The legal situation surrounding generative AI is still in a state of limbo, with many uncertainties regarding industry regulation and copyright law. Several leading generative AI providers have faced criticism for incorporating data into their model training without consulting the original authors or organizations that might own that data. Because large language models require so much text data for training, web scraping tools are often utilized to harvest data without explicit permission from the original source.

This decision by the New York Times represents a landmark in the fight between generative AI providers and the owners of data. It comes in the wake of Google’s announcement that it would be using public web data to train its AI models. Some generative AI providers are also moving in the other direction, offering ways for organizations to opt out of data harvesting for model training.

What’s Next for Generative AI Data?

It’s unclear what the long-term implications of the New York Times decision will be. To be sure, the material in the New York Times constitutes a negligible fraction of training data for LLMs like GPT-4. At the same time, the decision could cause other large media organizations to follow suit. If organizations like Reddit, which was heavily used to train state-of-the-art models were to implement similar restrictions, it could be represent an obstacle to model training.

And yet, it is difficult for any organization to strictly implement control over its data. With the rise of open source model development, many users may disobey terms of use such as those stipulated by the New York Times and scrape data anyway. Another consideration for generative AI providers is the use of synthetic data—now that models like GPT-4 are capable of producing considerable amounts of high-quality text, this could simply be used to train new models instead of real, human data. New York Times

Bottom Line

In summary, the legal landscape surrounding generative AI copyright and data ownership is still muddled. The New York Times’s choice to limit data usage represents one move by the organizations that own content to assert their stake in the development of generative AI.