What Sarah Silverman's lawsuit against OpenAI and Meta really means | The AI Beat

Litigation targeting the data scraping practices of AI companies developing large language models (LLMs) continued to heat up today, with the news that comedian and author Sarah Silverman is suing OpenAI and Meta for copyright infringement of her humorous memoir, The Bedwetter: Stories of Courage, Redemption, and Pee, published in 2010.

The lawsuit, filed by the San Francisco-based Joseph Saveri Law Firm — which also filed a suit against GitHub in 2022 — claims that Silverman and two other plaintiffs did not consent to the use of their copyrighted books as training material for OpenAI's ChatGPT and Meta's LLaMA, and that when ChatGPT or LLaMA is prompted, the tool generates summaries of the copyrighted works, something only possible if the models were trained on them.

Legal AI issues around copyright and 'fair use' growing louder

These legal issues around copyright and "fair use" are not going away — in fact, they go to the heart of what today's LLMs are made of — that is, the training data. As I discussed last week, web scraping for massive amounts of data can arguably be described as the secret sauce of generative AI. AI chatbots like ChatGPT, LLaMA, Claude (from Anthropic) and Bard (from Google) can spit out coherent text because they were trained on massive corpora of data, mostly scraped from the internet. And as the size of today’s LLMs like GPT-4 have ballooned to hundreds of billions of tokens, so has the hunger for data.

Data scraping practices in the name of training AI have recently come under attack. For example, OpenAI was hit with two other new lawsuits. One filed on June 28, also by the Joseph Saveri Law Firm, claims that OpenAI unlawfully copied book text by not getting consent from copyright holders or offering them credit and compensation. The other, filed the same day by the Clarkson Law Firm on behalf of more than a dozen anonymous plaintiffs, claims OpenAI’s ChatGPT and DALL-E collect people’s personal data from across the internet in violation of privacy laws.

Those lawsuits, in turn, come on the heels of a class action suit filed in January, Andersen et al. v. Stability AI, in which artist plaintiffs raised claims including copyright infringement. Getty Images also filed suit against Stability AI in February, alleging copyright and trademark infringement, as well as trademark dilution.

Sarah Silverman, of course, adds a new celebrity layer to the issues around AI and copyright — but what does this new lawsuit really mean for AI? Here are my predictions:

1. There are many more lawsuits coming.

In my article last week, Margaret Mitchell, researcher and chief ethics scientist at Hugging Face, called the AI data scraping issues "a pendulum swing,” adding that she had previously predicted that by the end of the year, OpenAI may be forced to delete at least one model because of these data issues.

Certainly, we should expect many more lawsuits to come. Way back in April 2022, when DALL-E 2 first came out, Mark Davies, partner at San Francisco-based law firm Orrick, agreed there are many open legal questions when it comes to AI and “fair use" — a legal doctrine that promotes freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances.

“What happens in reality is when there are big stakes, you litigate it,” he said. “And then you get the answers in a case-specific way.”

And now, renewed debate around data scraping has “been percolating,” Gregory Leighton, a privacy law specialist at law firm Polsinelli, told me last week. The OpenAI lawsuits alone, he said, are enough of a flashpoint to make other pushback inevitable. “We’re not even a year into the large language model era — it was going to happen at some point,” he said.

The legal battles around copyright and fair use could ultimately end up in the Supreme Court, Bradford Newman, who leads the machine learning and AI practice of global law firm Baker McKenzie, told me last October.

“Legally, right now, there is little guidance,” he said, around whether copyrighted input going into LLM training data is "fair use." Different courts, he predicted, will come to different conclusions: “Ultimately, I believe this is going to go to the Supreme Court.”

2. Datasets will be increasingly scrutinized, but it will be hard to enforce.

In Silverman's lawsuit, the authors claim that OpenAI and Meta intentionally removed copyright-management information such as copyright notices and titles.

“Meta knew or had reasonable grounds to know that this removal of [copyright management information] would facilitate copyright infringement by concealing the fact that every output from the LLaMA language models is an infringing derivative work,” the authors alleged in their complaint against Meta.

The authors’ complaints also speculated that ChatGPT and LLaMA were trained on massive datasets of books that skirt copyright laws, including "shadow libraries" like Library Genesis and ZLibrary.

“These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host,” reads the authors’ complaint against Meta. “For that reason, these shadow libraries are also flagrantly illegal.”

But a Bloomberg Law article last October pointed out that there are many legal hurdles to overcome when it comes to battling copyright against a shadow library. For example, many of the site operators are based in countries outside of the U.S., according to Jonathan Band, an intellectual property attorney and founder of Jonathan Band PLLC.

“They’re beyond the reach of U.S. copyright law,” he wrote in the article. “In theory, one could go to the country where the database is hosted. But that’s expensive and sometimes there are all kinds of issues with how effective the courts there are, or if they have a good judicial system or a functional judicial system that can enforce orders.”

In addition, the onus is often on the creator to prove that the use of copyrighted work for AI training resulted in a "derivative" work. In an article in The Verge last November, Daniel Gervais, a professor at Vanderbilt Law School, said training a generative AI on copyright-protected data is likely legal, but the same cannot necessarily be said for generating content — that is, what you do with that model might be infringing.

And, Katie Gardner, a partner at international law firm Gunderson Dettmer, told me last week that fair use is “a defense to copyright infringement and not a legal right.” In addition, it can also be very difficult to predict how courts will come out in any given fair use case, she said. “There is a score of precedent where two cases with seemingly similar facts were decided differently.”

But she emphasized that there is Supreme Court precedent that leads many to infer that use of copyrighted materials to train AI can be fair use based on the transformative nature of such use — that is, it doesn’t transplant the market for the original work.

3. Enterprises will want their own models or indemnification

Enterprise businesses have already made it clear that they don't want to deal with the risk of lawsuits related to AI training data — they want safe access to create generative AI content that is risk-free for commercial use.

That's where indemnification has moved front and center: Last week, Shutterstock announced that it will offer enterprise customers full indemnification for the license and use of generative AI images on its platform to protect them against potential claims related to their use of the images. The company said it would fulfill requests for indemnification on demand through a human review of the images.

That news came just a month after Adobe announced a similar offering: “If a customer is sued for infringement, Adobe would take over legal defense and provide some monetary coverage for those claims,” a company spokesperson said.

And new poll data from enterprise MLOps platform Domino Data Lab found that data scientists believe generative AI will significantly impact enterprises over the next few years, but its capabilities cannot be outsourced — that is, enterprises need to fine-tune or control their own gen AI models.

Besides data security, IP protection is another issue, said Kjell Carlsson, head of data science strategy at Domino Data Lab. “If it’s important and really driving value, then they want to own it and have a much greater degree of control,” he said.

Legal AI issues around copyright and 'fair use' growing louder

1. There are many more lawsuits coming.

2. Datasets will be increasingly scrutinized, but it will be hard to enforce.

3. Enterprises will want their own models or indemnification

More