The data that trains AI is under the spotlight — and even I'm weirded out | The AI Beat

It is widely understood that today's AI is hungry for data and that large language models (LLMs) are trained on massive unlabeled data sets. But last week, the general public got a revealing peek under the hood of one of them, when the Washington Post published a deep dive into Google's C4 data set, or the English Colossal Clean Crawled Corpus.

Working with researchers from the Allen Institute for AI, the publication uncovered the 15 million websites, including proprietary, personal, and offensive websites, that went into the training data — which were used to train high-profile models like Google's T5 and Meta's LLaMA.

According to the article, the dataset was "dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence."

The nonprofit CommonCrawl did a scrape for C4 in April 2019. CommonCrawl told The Washington Post that it "tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content."

VentureBeat is well-represented in the corpus of data

It shouldn't come as a surprise, then, that a quick search of the websites in the dataset (offered in the article through a simple search box) showed that VentureBeat was well represented, with 10 million tokens (small bits of text used to process disorganized information — typically a word or phrase). But it was disconcerting to find that nearly every publication I've ever written for is, too — even the ones where I tried to sign favorable freelance contracts — and even my personal music website is part of the dataset.

Keep in mind, I've developed a thick skin when it comes to icky data-digging. I started writing about data analytics over 10 years ago for a magazine covering the direct marketing industry — a business that for decades had relied on mailing list brokers that sold or rented access to valuable datasets. I spent years covering the wild and woolly world of digital advertising technology, with its creepy "cookies" that allow brands to follow you all around the web. And it's felt like eons since I discovered that the GPS in my car and my phone was gathering data to share with brands.

So I had to ask myself: Why did I feel so weirded out that my creative output has been sucked into the vacuum of AI datasets when so much of my life is already up for grabs?

Training AI models with massive datasets isn't new

Training AI models with massive datasets is not new, of course. The Google C4 dataset was published in 2020, while The Pile, another large diverse, open-source language modeling dataset developed by Eleuther AI, which consists of everything from PubMed to Wikipedia to Github, was also published in 2020. Stability AI's new language model, StableLM, was trained on a new experimental dataset built on The Pile containing 1.5 trillion tokens.

In fact, The Pile has been so widely shared at this point that Eleuther argued in a recent Guardian article that it “does not constitute significantly increased harm." That said, back in 2021 Stella Rose Biderman, executive director of Eleuther AI, pointed out on Twitter that she considered the C4 dataset to be "lower-quality than the Pile, or any other dataset that is curated and selectively produced." In addition, she said at that time that she was "thrilled this dataset is public ... a major reason #EleutherAI made the Pile was a lack of publicly available (and therefore publicly criticizable) datasets for training LLMs."

Certainly part of the "yuck" factor is that it is so hard to wrap my mind around the scale of data that we're talking about here and the lack of clarity around how, exactly, the data is being used.

In the Guardian article, Michael Wooldridge, a professor of computer science at the University of Oxford, said that LLMs, such as those that underpin OpenAI’s ChatGPT and Google’s Bard, hoover up colossal amounts of data.

“This includes the whole of the world wide web — everything. Every link is followed in every page, and every link in those pages is followed … In that unimaginable amount of data there is probably a lot of data about you and me,” he said. “And it isn’t stored in a big database somewhere — we can’t look to see exactly what information it has on me. It is all buried away in enormous, opaque neural networks.”

The human side of AI training data

At the heart of what bothers me are, I think, questions about the human side of AI training data. It's not that I think my job as senior writer at VentureBeat is imminently at risk because of large language models models like ChatGPT, but it is nevertheless disconcerting to know that my articles are part of the dataset training them. It feels kind of like I helped train the ambitious intern who pretends to be the Goose to my Maverick but plans to kick me out of the plane altogether. And as a writer who covers the world of AI, it feels especially meta.

AI researchers don't necessarily agree. For example, last week I spoke to Vipul Ved Prakash, founder and CEO of Together, which announced that its RedPajama project had replicated Meta's LLaMA dataset with the goal of building open-source, state-of-the-art LLMs.

Prakash told me that he thinks “these models capture in some ways the output of human society and there is a sort of obligation to make them open and usable by everyone," adding that “most of the magic” of these models comes from the fact that they are trained on “really broad and vast” data.

He also pointed out that the original data is compressed significantly in the actual models that result. The RedPajama dataset is 5 terabytes, but the models created can be as small as 14 GB, ~500 times smaller than the original data they are modeling.

“This means that knowledge from the data is abstracted, transformed and modeled in a very different representation of weights and biases of parameters in the neural network model, and not stored and used in its original form,” said Prakash. So, it is “not reproducing the training data — it is derivative work on top of that. From our understanding, it is considered fair use as long as the model is not reproducing the data — it’s learning from it.”

Pushing back against the tokenization of data

I can understand Prakash's point of view as an AI researcher. But as a human creator, I can also understand that no matter how our data is "abstracted, transformed and modeled," it comes from human output, which means there are consequences. I mean, if you're vegetarian, just because the animal parts have been boiled into oblivion, it doesn't mean that foods containing gelatin aren't off-limits.

There are massive copyright issues around large language models, with more and more lawsuits coming down the pike. There are significant concerns around misinformation, with discussions about regulation moving front and center. Companies like OpenAI have almost entirely closed up about what datasets they use to build their models. They certainly know that the more publicity these massive datasets get, the more pushback there will be from the public, which is just beginning to understand the ramifications of sharing their lives and livelihoods with the internet.

I don't know what the solutions are to these challenges. But I'll continue to report on the possibilities. Starting next week, however, I'll be taking a brief pause on adding to the web's datasets — I'm heading out on a two-week vacation starting April 30. I'll return with a new AI Beat in mid-May!

VentureBeat is well-represented in the corpus of data

Training AI models with massive datasets isn't new

The human side of AI training data

Pushing back against the tokenization of data

More