Google, Apple, and others show large language models trained on public data expose personal information

Large language models like OpenAI's GPT-3 and Google's GShard learn to write humanlike text by internalizing billions of examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs. But a new study jointly published by Google, Apple, Stanford University, OpenAI, the University of California, Berkeley, and Northeastern University demonstrates the pitfall of this training approach. In it, the coauthors show that large language models can be prompted to show sensitive, private information when fed certain words and phrases.

It's a well-established fact that models can "leak" details from the data on which they're trained. Leakage, also known as data leakage or target leakage, is the use of information in the training process that couldn't be expected to be available when the model makes predictions. This is of particular concern for all large language models, because their training datasets can sometimes contain names, phone numbers, addresses, and more.

In the new study, the researchers experimented with GPT-2, which predates OpenAI's powerful GPT-3 language model. They claim that they chose to focus on GPT-2 to avoid "harmful consequences" that might result from conducting research on a more recent, popular language model. To further minimize harm, the researchers developed their training data extraction attack using publicly available data and followed up with people whose information was extracted, obtaining their blessing before including redacted references in the study.

By design, language models make it easy to generate an abundance of output. By seeding with random phrases, the model can be prompted to generate millions of continuations, or phrases that complete a sentence. Most of the time, these continuations are benign strings of text, like the word "lamb" following "Mary had a little…" But if the training data happens to repeat the string "Mary had a little wombat" very often, for instance, the model might predict that phrase instead.

The coauthors of the paper sifted through millions of output sequences from the language model and predicted which text was memorized. They leveraged the fact that models tend to be more confident in results captured from training data; by checking the confidence of GPT-2 on a snippet, they could predict if the snippet appeared in the training data.

The researchers report that, of 1,800 snippets from GPT-2, they extracted more than 600 that were memorized from the training data. The examples covered a range of content including news headlines, log messages, JavaScript code, personally identifiable information, and more. Many appeared only infrequently in the training dataset, but the model learned them anyway, perhaps because the originating documents contained multiple instances of the examples.

The coauthors also found that larger language models more easily memorize training data compared with smaller models. For example, in one experiment, they report that GPT-2 XL, which contains 1.5 billion parameters -- the variables internal to the model that influence its predictions -- memorizes 10 times more information than the 124-million-parameter GPT-2.

While it's beyond the scope of the work, this second finding has implications for models like the 175-billion-parameter GPT-3, which is publicly accessible via an API. Microsoft's Turing Natural Language Generation Model, a model that powers a number of services on Azure, contains 17 billion parameters. And Facebook is using a model for translation with over 12 billion parameters.

The coauthors of the study note that it might be possible to mitigate memorization somewhat through the use of differential privacy, which allows training on a dataset without revealing any details of individual training examples. But even differential privacy has limitations and won't prevent memorization of content that's repeated often enough

"Language models continue to demonstrate great utility and flexibility -- yet, like all innovations, they can also pose risks. Developing them responsibly means proactively identifying those risks and developing ways to mitigate them," Google research scientist Nicholas Carlini wrote in a blog post. "Given that the research community has already trained models 10 to 100 times larger, this means that as time goes by, more work will be required to monitor and mitigate this problem in increasingly large language models ... The fact that these attacks are possible has important consequences for the future of machine learning research using these types of models."

Beyond leaking sensitive information, language models remain problematic in that they amplify the biases in data on which they were trained. Often, a portion of the training data is sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism." Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular models, including Google's BERT and XLNet, OpenAI's GPT-2, and Facebook's RoBERTa. This bias could be leveraged by malicious actors to foment discord by spreading misinformation, disinformation, and outright lies that "radicalize individuals into violent far-right extremist ideologies and behaviors," according to the Middlebury Institute of International Studies.

OpenAI previously said it's experimenting with safeguards at the API level including "toxicity filters" to limit harmful language from GPT-3. For instance, it hopes to deploy filters that pick up anti-Semitic content while still letting through neutral content talking about Judaism.

It remains unclear what steps might eliminate the threat of memorization, much less toxicity, sexism, and racism. But Google, for one, has shown a willingness to brush aside these ethical concerns when convenient. Last week, leading AI researcher Timnit Gebru was fired from her position on an AI ethics team at Google in what she claims was retaliation for sending colleagues an email critical of the company's managerial practices. The flashpoint was reportedly a paper Gebru coauthored that questioned the wisdom of building large language models and examined who benefits from them and who is disadvantaged.

In the draft paper, Gebru and colleagues reasonably suggest that large language models have the potential to mislead AI researchers and prompt the general public to mistake their text as meaningful. Popular natural language benchmarks don't measure AI models' general knowledge well, studies show.

It's no secret that Google has commercial interests in conflict with the viewpoints expressed in the paper. Many of the large language models it develops power customer-facing products, including Cloud Translation API and Natural Language API. While Google CEO Sundar Pichai has apologized for the handling of Gebru's firing, it bodes poorly for Google's willingness to address critical issues around large language models. Time will tell if rivals, including Microsoft and Facebook, react any better.

More