Generative AI datasets could face a reckoning | The AI Beat

Over the weekend, a bombshell story from The Atlantic found that Stephen King, Zadie Smith and Michael Pollan are among thousands of authors whose copyrighted works were used to train Meta’s generative AI model, LLaMA, as well as other large language models, using a dataset called “Books3.” The future of AI, the report claimed, is “written with stolen words.”

The truth is, the issue of whether the works were “stolen” is far from settled, at least when it comes to the messy world of copyright law. But the datasets used to train generative AI could face a reckoning — not just in American courts, but in the court of public opinion.

Datasets with copyrighted materials: an open secret

It's an open secret that LLMs rely on the ingestion of large amounts of copyrighted material for the purpose of “training.” Proponents and some legal experts insist this falls under what is known a "fair use" of the data — often pointing to the federal ruling in 2015 that Google’s scanning of library books displaying “snippets” online did not violate copyright — though others see an equally persuasive counterargument.

Still, until recently, few outside the AI community had deeply considered how the hundreds of datasets that enabled LLMs to process vast amounts of data and generate text or image output — a practice that arguably began with the release of ImageNet in 2009 by Fei-Fei Li, an assistant professor at Princeton University — would impact many of those whose creative work was included in the datasets. That is, until ChatGPT was launched in November 2022, rocketing generative AI into the cultural zeitgeist in just a few short months.

The AI-generated cat is out of the bag

After ChatGPT emerged, LLMs were no longer simply interesting as scientific research experiments, but commercial enterprises with massive investment and profit potential. Creators of online content — artists, authors, bloggers, journalists, Reddit posters, people posting on social media — are now waking up to the fact that their work has already been hoovered up into massive datasets that trained AI models that could, eventually, put them out of business. The AI-generated cat, it turns out, is out of the bag — and lawsuits and Hollywood strikes have followed.

At the same time, LLM companies such as OpenAI, Anthropic, Cohere and even Meta — traditionally the most open source-focused of the Big Tech companies, but which declined to release the details of how LLaMA 2 was trained — have become less transparent and more secretive about what datasets are used to train their models.

“Few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on,” according to The Atlantic. “Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet — that is, it requires the kind found in books.” In a lawsuit filed in California last month, the writers Sarah Silverman, Richard Kadrey, and Christopher Golden allege that Meta violated copyright laws by using their books to train LLaMA.

The Atlantic obtained and analyzed Books3, which was used to train LLaMA as well as Bloomberg’s BloombergGPT, EleutherAI’s GPT-J — a popular open-source model — and likely other generative-AI programs now embedded in websites across the internet. The article's author identified more than 170,000 books that were used — including five by Jennifer Egan, seven by Jonathan Franzen, nine by bell hooks, five by David Grann and 33 by Margaret Atwood.

In an email to The Atlantic, Stella Biderman of Eleuther AI, which created the Pile, wrote: “We work closely with creators and rights holders to understand and support their perspectives and needs. We are currently in the process of creating a version of the Pile that exclusively contains documents licensed for that use.”

Data collection has a long history

Data collection has a long history — mostly for marketing and advertising. There were the days of mid-20th-century mailing list brokers who “boasted that they could rent out lists of potentially interested consumers for a litany of goods and services.”

With the advent of the internet over the past quarter-century, marketers moved into creating vast databases to analyze everything from social-media posts to website cookies and GPS locations in order to personally target ads and marketing communications to consumers. Phone calls “recorded for quality assurance” have long been used for sentiment analysis.

In response to issues related to privacy, bias and safety, there have been decades of lawsuits and efforts to regulate data collection, including the EU’s GDPR law, which went into effect in 2018. The U.S., however, which historically has allowed businesses and institutions to collect personal information without express consent except in certain sectors, has not yet gotten the issue to the finish line.

But the issue now is not only related to privacy, bias or safety — generative AI models affect the workplace and society at large. Many no doubt believe that generative AI issues related to labor and copyright are just a retread of previous societal changes around employment, and that consumers will accept what is happening as not much different than the way Big Tech has gathered their data for years. But millions of people believe their data has been stolen — and they will likely not go quietly.

A day of reckoning may be coming for generative AI datasets

That doesn’t mean, of course, that they may not ultimately have to give up the fight. But it also doesn’t mean that Big Tech will win big. So far, most legal experts I’ve spoken to have made it clear that the courts will decide — the issue could go as far as the Supreme Court — and there are strong arguments on either side of the argument around the datasets used to train generative AI.

Enterprises and AI companies would do well, I think, to consider transparency to be the best option. After all, what does it mean if experts can only speculate as to what is in powerful, sophisticated, massive AI models like GPT-4 or Claude or Pi?

Datasets used to train LLMs are no longer simply benefitting researchers searching for the next breakthrough. While some may argue that generative AI will benefit the world, there is no longer any doubt that copyright infringement is rampant. As companies seeking commercial success get ever-hungrier for data to feed their models, there may be ongoing temptation to grab all the data they can. It is not certain that this will end well: A day of reckoning may be coming.

Datasets with copyrighted materials: an open secret

The AI-generated cat is out of the bag

Data collection has a long history

A day of reckoning may be coming for generative AI datasets

More