AI ethics pioneer's exit from Google involved research into risks and inequality in large language models

Following a dispute over several emails and a research paper on Wednesday, AI ethics pioneer and research scientist Timnit Gebru no longer works at Google. According to a draft copy obtained by VentureBeat, the research paper surrounding her exit questions the wisdom of building large language models and examines who benefits from them, who is impacted by negative consequences of their deployment, and whether there is such a thing as a language model that's too big.

Gebru's research has been hugely influential on the subjects of algorithmic fairness, bias, and facial recognition. In an email to Google researchers on Thursday, Google AI chief Jeff Dean said he accepted Gebru's resignation following a disagreement about the paper, but Gebru said she never offered to resign.

"Most language technology is in fact built first and foremost to serve the needs of those who already have the most privilege in society," the paper reads. "A methodology that relies on datasets too large to document is therefore inherently risky. While documentation allows for potential accountability, similar to how we can hold authors accountable for their produced text, undocumented training data perpetuates harm without recourse. If the training data is considered too large to document, one cannot try to understand its characteristics in order to mitigate some of these documented issues or even unknown ones."

In the paper titled "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?," authors say risks associated with deploying large language models range from environmental racism as AI's carbon footprint impacts marginalized communities more than others to the way models absorb a "hegemonic world view from the training data." There's also the risk AI can perpetuate abusive language, hate speech, microaggressions, stereotypes, and other dehumanizing forms of language aimed at specific groups of people.

Another consequence is that costs associated with training large language models can create a barrier to entry for deep learning research. The scale also increases the chance that people will trust predictions made by language models without questioning the results.

Authors include Google AI co-lead Meg Mitchell and Google researchers Ben Hutchinson, Mark Diaz, and Vinodkumar Prabhakaran, as well as University of Washington Ph.D. student Angelina McMillan-Major.

Gebru is listed first among the paper's authors, alongside University of Washington linguist Emily Bender. A teacher of an NLP ethics course, Bender coauthored a paper that won an award from the Association for Computational Linguistics. That paper urged NLP researchers to question the hype around the idea that large language models are capable of understanding. In an interview with VentureBeat, she stressed the need for better testing methods and lamented a culture in language model research that overfits models to benchmark tasks, a pursuit she says can stand in the way of "good science."

On Thursday, more than 230 Googlers and over 200 supporters from academia, industry, and civil society in signing a letter with a series of demands. These include a transparent evaluation of who was involved in determining that Bender and Gebru should withdraw their research for the general public and Google users.

"This has become a matter of public concern, and there needs to be public accountability to ensure any trust in Google Research going forward," the letter reads.

By Friday morning, near 800 Googlers and more than 1,100 supporters from academia, industry, and civil society signed the letter.

Dean was critical of the paper in an email to Google researchers Thursday and said a review process found that the paper "ignored too much relevant research" about large language models and did not take into account recent research into mitigation of bias in language models.

A trend toward creating language models with more parameters and training data was triggered by a move toward use of the Transformer architecture and massive amounts of training data scraped from the web or sites like Reddit or Wikipedia.

Google's BERT and variations like ALBERT and XLNet led the way in that trend, alongside models like Nvidia's Megatron and OpenAI's GPT-2 and GPT-3. Whereas Google's BERT had 340 million parameters, Megatron has 8.3 billion parameters; Microsoft's T-NLG has 17 billion parameters; and GPT-3, which was introduced in May by Open AI and is the largest language model to date, has 175 billion parameters. With increased size, large models achieved higher scores in tasks like question-answering or reading understanding.

But numerous studies have found forms of bias in large pretrained language models. This spring, for example, NLP researchers introduced the StereoSet dataset, benchmark and leaderboard, and found that virtually all popular pretrained language models today exhibit bias based on ethnicity, race, and gender.

Coauthors suggest language models be evaluated based on other metrics -- including energy efficiency and the estimated CO2 emissions involved with training a model -- rather than evaluating performance on a series of tasks using benchmarks like GLUE.

The researchers argue that large pretrained language models also have the potential to mislead AI researchers and prompt the general public to mistake text generated by language models like OpenAI's GPT-3 as meaningful.

"If a large language model, endowed with hundreds of billions of parameters and trained on a very large dataset, can manipulate linguistic form well enough to cheat its way through tests meant to require language understanding, have we learned anything of value about how to build machine language understanding or have we been led down the garden path?" the paper reads. "In summary, we advocate for an approach to research that centers the people who stand to be affected by the resulting technology, with a broad view on the possible ways that technology can affect people."

The paper recommends solutions like working with impacted communities, value sensitive design, improved data documentation, and adoption of frameworks such as Bender's data statements for NLP or the datasheets for datasets approach Gebru coauthored while at Microsoft Research.

A McKinsey survey of business leaders conducted earlier this year found that little progress has been made in mitigating 10 major risks associated with deploying AI models.

Criticism of large models trained using massive datasets scraped from the web has been a marked AI research trend in 2020.

In computer vision, an audit released this summer of 80 Million Tiny Images, a large image dataset revealed the inclusion of racist, sexist, and pornographic content. Instead of taking recommended steps to change the dataset, creators from MIT and NYU opted to stop using it and delete existing copies.

Last month, researchers analyzed papers published at conferences and found that elite universities and Big Tech companies enjoy a competitive advantage in the age of deep learning that has created a compute divide concentrating power in the hands of a few and accelerating inequality.

Roughly one year ago, Stanford professor emeritus of computer science Yoav Shoham questioned the brittle nature of language models that demonstrate quick advancements in benchmark tests.

“The thing is these are highly specialized tasks and domains, and as soon as you go out of domain, the performance drops dramatically and the committee knows it,” Shoham told VentureBeat in December 2019. "There’s a lot to be excited about genuinely, including all these systems that I mentioned, but we’re quite far away from human level understanding of language right now."

Update Dec. 4 at 8:23 a.m.

Correction: This story initially stated that Emily Denton was a coauthor of this paper. However, Emily Bender was a coauthor. We regret any confusion this error may have caused.

More