Generative AI’s secret sauce — data scraping— comes under attack

Web scraping for massive amounts of data can arguably be described as the secret sauce of generative AI. After all, AI chatbots like ChatGPT, Claude, Bard and LLaMA can spit out coherent text because they were trained on massive corpora of data, mostly scraped from the internet. And as the size of today's LLMs like GPT-4 have ballooned to hundreds of billions of tokens, so has the hunger for data.

Data scraping practices in the name of training AI have come under attack over the past week on several fronts. OpenAI was hit with two lawsuits. One, filed in federal court in San Francisco, alleges that OpenAI unlawfully copied book text by not getting consent from copyright holders or offering them credit and compensation. The other claims OpenAI’s ChatGPT and DALL·E collect people’s personal data from across the internet in violation of privacy laws.

Twitter also made news around data scraping, but this time it sought to protect its data by limiting access to it. In an effort to curb the effects of AI data scraping, Twitter temporarily prevented individuals who were not logged in from viewing tweets on the social media platform and also set rate limits for how many tweets can be viewed.

For its part, Google doubled down to confirm that it scrapes data for AI training. Last weekend, it quietly updated its privacy policy to include Bard and Cloud AI alongside Google Translate in the list of services where collected data may be used.

A leap in public understanding of generative AI models

All of this news around scraping the web for AI training is not a coincidence, Margaret Mitchell, researcher and chief ethics scientist at Hugging Face, told VentureBeat by email.

“I think it's a pendulum swing,” she said, adding that she had previously predicted that by the end of the year, OpenAI may be forced to delete at least one model because of these data issues. The recent news, she said, made it clear that a path to that future is visible — so she admits that “it is optimistic to think something like that would happen while OpenAI is cozying up to regulators so much.”

But she says the public is learning more about generative AI models, so the pendulum has swung from rapt fascination with ChatGPT to wondering where the data for these models comes from.

“The public first had to learn that ChatGPT is based on a machine learning model," Mitchell explained, and that there are similar models everywhere and that these models "learn" from training data. "All of that is a massive leap forward in public understanding over just the past year," she emphasized.

Renewed debate around data scraping has "been percolating," agreed Gregory Leighton, a privacy law specialist at law firm Polsinelli. The OpenAI lawsuits alone, he said, are enough of a flashpoint to make other pushback inevitable. "We're not even a year into the large language model era — it was going to happen at some point," he said. "And [companies like] Google and Twitter are bringing some of these things to a head in their own contexts."

For companies, the competitive moat is the data

Katie Gardner, a partner at international law firm Gunderson Dettmer, told VentureBeat by email that for companies like Twitter and Reddit, the "competitive moat is in the data" — so they don't want anyone scraping it for free.

"It will be unsurprising if companies continue to take more actions to find ways to restrict access, maximize use rights and retain monetization opportunities for themselves," she said. "Companies with significant amounts of user-generated content who may have traditionally relied on advertising revenue could benefit significantly by finding new ways to monetize their user data for AI model training," whether for their own proprietary models or by licensing data to third parties.

Polsinelli's Leighton agreed, saying that organizations need to shift their thinking about data. "I've been saying to my clients for some time now that we shouldn't be thinking about ownership about data anymore, but about access to data and data usage," he said. "I think Reddit and Twitter are saying, well, we're going to put technical controls in place, and you're going have to pay us for access — which I do think puts them in a slightly better position than other [companies]."

Different privacy issues around data scraping for AI training

While data scraping has been flagged for privacy issues in other contexts, including digital advertising, Gardner said the use of personal data in AI models presents unique privacy issues as compared to general collection and use of personal data by companies.

One, she said, is the lack of transparency. "It’s very difficult to know if personal data was used, and if so, how it is being used and what the potential harms are from that use — whether those harms are to an individual or society in general," she said, adding that the second issue is that once a model is trained on data, it may be impossible to “untrain it” or delete or remove data. "This factor is contrary to many of the themes of recent privacy regulations which vest more rights in individuals to be able request access to and deletion of their personal data," she explained.

Mitchell agreed, adding that with generative AI systems there is a risk of private information being re-produced and re-generated by the system. "That information [risks] being further amplified and proliferated, including to bad actors who otherwise would not have had access or known about it," she said.

Is this a moot point where models that are already trained are concerned? Could a company like OpenAI be off the hook for GPT-3 and GPT-4, for example? According to Gardner, the answer is no: "Companies who have previously trained models will not be exempt from future judicial decisions and regulation."

That said, how companies will comply with stringent requirements is an open issue. "Absent technical solutions, I suspect at least some companies may need to completely retrain their models — which could be an enormously expensive endeavor," Gardner said. "Courts and governments will need to balance the practical harms and risks in their decision-making against those costs and the benefits this technology can provide society. We are seeing a lot of lobbying and discussions on all sides to facilitate sufficiently informed rule-making."

'Fair use' of scraped data continues to drive discussion

For creators, much of the discussion around data scraping for AI training revolves around whether or not copyrighted works can be determined to be “fair use" according to U.S. copyright law — which "permits limited use of copyrighted material without having to first acquire permission from the copyright holder" — as many companies like OpenAI claim.

But Gardner points out that fair use is "a defense to copyright infringement and not a legal right." In addition, it can also be very difficult to predict how courts will come out in any given fair use case, she said: "There is a score of precedent where two cases with seemingly similar facts were decided differently."

But she emphasized that there is Supreme Court precedent that leads many to infer that use of copyrighted materials to train AI can be fair use based on the transformative nature of such use — i.e. it doesn’t transplant the market for the original work.

"However, there are scenarios where it may not be fair use — including, for example, if the output of the AI model is similar to the copyrighted work," she said. "It will be interesting to see how this plays out in the courts and legislative process — especially because we’ve already seen many cases where user prompting can generate output that very plainly appears to be a derivative of a copyrighted work, and thus infringing."

Scraped data in today's proprietary models remains unknown

The problem is, however, that no one knows what is in the datasets included in today's sophisticated proprietary generative AI models like OpenAI's GPT-4 and Anthropic's Claude.

In a recent Washington Post report, researchers at the Allen Institute for AI helped analyze one large dataset to show "what types of proprietary, personal, and often offensive websites ... go into an AI’s training data." But while the dataset, Google’s C4, included sites known for pirated e-books, content from artist websites like Kickstarter and Patreon, and a trove of personal blogs, it's just one example of a massive dataset; a large language model may use several. The recently released open-source RedPajama, which replicated the LLaMA dataset to build open-source, state-of-the-art LLMs, includes slices of datasets that include data from Common Crawl, arxiv, Github, Wikipedia and a corpus of open books.

But OpenAI's 98-page technical report released in March about the development of GPT-4 was notable mostly for what it did not include. In a section called "Scope and Limitations of this Technical Report," it says: “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

Data scraping discussion is a 'good sign' for generative AI ethics

Debates around datasets and AI have been going on for years, Mitchell pointed out. In a 2018 paper, "Datasheets for Datasets," AI researcher Timnit Gebru wrote that "currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents."

The paper proposed the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs and pretrained models. "The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability."

While this may currently seem unlikely given the current trend towards proprietary "black box" models, Mitchell said she considered the fact that data scraping is under discussion right now to be a "good sign that AI ethics discourse is further enriching public understanding."

"This kind of thing is old news to people who have AI ethics careers, and something many of us have discussed for years," she added. "But it's starting to have a public breakthrough moment — similar to fairness/bias a few years ago — so that's heartening to see."