Researchers detail systemic issues and risk to society in language models

Researchers at Google's DeepMind have discovered major flaws in the output of large language models like GPT-3 and warn these could have serious consequences for society, like enabling deception and reinforcing bias. Notably, coauthors of a paper on the study say harms can be proliferated by large language models without malicious intent on the creators' part. In other words, these harms can be spread accidentally, due to incorrect specifications around what an agent should be learning from or the model training process.

"We believe that language agents carry a high risk of harm, as discrimination is easily perpetuated through language. In particular, they may influence society in a way that produces value lock-in, making it harder to challenge problematic existing norms," the paper reads. "We currently don't have many approaches for fixing these forms of misspecification and the resulting behavioral issues."

The paper supposes that language agents could also enable "incitement to violence" and other forms of societal harm, particularly by people with political motives. The agents could also be used to spread dangerous information, like how to make weapons or avoid paying taxes. In a prime example from work published last fall, GPT-3 tells a person to commit suicide.

The DeepMind paper is the most recent study to raise concerns about the consequences of deploying large language models made with datasets scraped from the web. The best known paper on this subject is titled "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" and was published last month at the Fairness, Accountability, and Transparency conference by authors that include former Google Ethical AI team co-leads Margaret Mitchell and Timnit Gebru. This paper asserts language models that seem to be growing in size perpetuate stereotypes and carry environmental costs most likely to be born by marginalized groups.

While Google fired both of its researchers who chose to keep their names on the paper and required other Google research scientists to remove their names from a paper that reached a similar conclusion, the DeepMind research cites the stochastic parrots paper among related works.

Earlier this year, a paper from OpenAI and Stanford University researchers detailed a meeting between experts from fields like computer science, political science, and philosophy. The group concluded that companies like Google and OpenAI, which control the largest known language models in the world, have only a matter of months to set standards around the ethical use of the technology before it's too late.

The DeepMind paper joins a series of works that highlight NLP shortcomings. In late March, nearly 30 businesses and universities from around the world found major issues in an audit of five popular multilingual datasets used for machine translation.

A paper written about that audit found that in a significant fraction of the major dataset portions evaluated, less than 50% of the sentences were of acceptable quality, according to more than 50 volunteers from the NLP community.

Businesses and organizations listed as coauthors of that paper include Google and Intel Labs and come from China, Europe, the United States, and multiple nations in Africa. Coauthors include the Sorbonne University (France), the University of Waterloo (Canada), and the University of Zambia. Major open source advocates also participated, like EleutherAI, which is working to replicate GPT-3; Hugging Face; and the Masakhane project to produce machine translation for African languages.

Consistent issues with mislabeled data arose during the audit, and volunteers found that a scan of 100 sentences in many languages could reveal serious quality issues, even to people who aren't proficient in the language.

"We rated samples of 205 languages and found that 87 of them had under 50% usable data," the paper reads. "As the scale of ML research grows, it becomes increasingly difficult to validate automatically collected and curated datasets."

The paper also finds that building NLP models with datasets automatically drawn from the internet holds promise, especially in resolving issues encountered by low-resource languages, but there's very little research today about data collected automatically for low-resource languages. The authors suggest a number of solutions, like the kind of documentation recommended in Google's stochastic parrots paper or standard forms of review, like the datasheets and model cards Gebru prescribed or the dataset nutrition label framework.

In other news, researchers from Amazon, ChipBrain, and MIT found that test sets of the 10 most frequently cited datasets used by AI researchers have an average label error rate of 3.4%, impacting benchmark results.

This week, organizers of NeurIPS, the world's largest machine learning conference, announced plans to create a new track devoted to benchmarks and datasets. A blog post announcing the news begins with the simple declaration that "There are no good models without good data."

Last month, the 2021 AI Index, an annual report that attempts to define trends in academia, business, policy, and system performance, found that AI is industrializing rapidly. But it named a lack of benchmarks and testing methods as major impediments to progress for the artificial intelligence community.

More