The limitations of scaling up AI language models

Large language models like OpenAI's GPT-3 show an aptitude for generating humanlike text and code, automatically writing emails and articles, composing poetry, and fixing bugs in software. But the dominant approach to developing these models involves leveraging massive computational resources, which has consequences. Beyond the fact that training and deploying large language models can incur high technical costs, the requirements put the models beyond the reach of many organizations and institutions. Scaling also doesn't resolve the major problem of model bias and toxicity, which often creeps in from the data used to train the models.

In a panel during the Conference on Neural Information Processing Systems (NeurIPS) 2021, experts from the field discussed how the research community should adapt as progress in language models continues to be driven by scaled-up algorithms. The panelists explored how to ensure that smaller institutions and can meaningfully research and audit large-scale systems, as well as ways that they can help to ensure that the systems behave as intended.

Melanie Mitchell, a professor of computer science at Santa Fe Institute, raised the point that it's difficult to ensure the same norms of reproducibility for large language models compared with other, smaller types of AI systems. AI already had a reproducibility problem -- studies often provide benchmark results in lieu of source code, which becomes problematic when the thoroughness of the benchmarks is called into question. But the vast computation required to test large language models threatens to exacerbate the problem, particularly as the models in question double, triple, or even quadruple in size.

In an illustration of the challenge of working with large language models, Nvidia recently open-sourced Megatron-Turing Natural Language Generation (MT-NLG), one of the world's largest language models with 530 billion parameters. In machine learning, parameters are the part of the model that's learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. The model was originally trained across 560 Nvidia DGX A100 servers, each hosting 8 Nvidia A100 80GB GPUs. Microsoft and Nvidia say that they observed between 113 to 126 teraflops per second (a measure of performance) per GPU while training MT-NLG, which would put the training cost in the millions of dollars.

Even OpenAI -- which has hundreds of millions of dollars in funding from Microsoft -- struggles with this. The company didn't fix a mistake when it implemented GPT-3, a language model with less than half as many parameters as MT-NLG, because the cost of training made retraining the model infeasible.

"Often, people at machine learning conferences will give results like, 'new numbers of parameters in our system yielded this new performance on this benchmark,' but it's really hard to understand exactly why [the system achieves this]," Mitchell said. "It brings up the difficulty of doing science with these systems ... Most people in academia don't have the compute resources to do the kind of science that's needed."

However, even with the necessary compute resources, benchmarking large language models isn't a solved problem. It's the assertion of some experts that popular benchmarks do a poor job of estimating real-world performance and fail to take into account the broader ethical, technical, and societal implications. For example, one recent study found that 60% to 70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were memorizing answers.

"[The] ways that we measure performance of these systems needs to be expanded ... When the benchmarks are changed a little bit, they [often] don't generalize well," Mitchell continued. "So I think the ways that we probe the systems and the ways that we measure their performance has to be a big issue in this entire field, and that we have to spend more time on that."

Constraints breed creativity

Joelle Pineau, co-managing director at Meta AI Research, Meta's (formerly Facebook) AI research division, questioned what kind of scientific knowledge can be gained from simply scaling large language models. To her point, the successor to GPT-3 will reportedly contain around 100 trillion parameters, but in a research paper published this week, Alphabet's DeepMind detailed a language model -- RETRO -- that it claims can beat others 25 times its size by using "external memory" techniques.

In fact, being resource-constrained can lead to novel solutions with implications beyond the problem they were originally created to solve. DeepMind research scientist Oriol Vinyals made the point that the Transformer, an AI architecture that has gained considerable attention within the last several years, came about in search of a more resource-efficient way to develop natural language systems. Since its introduction in 2017, the Transformer has become the architecture of choice for natural language tasks and has demonstrated an aptitude for summarizing documents, composing music, translating between languages, analyzing DNA sequences, and more.

These solutions could touch on bias, potentially -- a perennial concern in natural language processing. As another DeepMind work spotlights, large language models can perpetuate stereotypes and harm disadvantaged groups by performing poorly for them. Moreover, these models can provide false or misleading information, or outright disinformation, undermining trust.

"I would add that one of the dangers of these models is that people give them too much credit," Mitchell said. "They sound really human and they can do all these things, and so people -- not just general public, but also AI researchers themselves -- sort of anthropomorphize them too much ... and perhaps are allowing people to use them in ways that they shouldn't necessarily be used. [W]e should emphasize not only [the] capabilities [of large language models], but their limits."