AI Weekly: Novel architectures could make large language models more scalable

Beginning in earnest with OpenAI's GPT-3, the focus in the field of natural language processing has turned to large language models (LLMs). LLMs -- denoted by the amount of data, compute, and storage that's required to develop them -- are capable of impressive feats of language understanding, like generating code and writing rhyming poetry. But as an increasing number of studies point out, LLMs are impractically large for most researchers and organizations to take advantage of. Not only that, but they consume an amount of power that puts into question whether they're sustainable to use over the long run.

New research suggests that this needn't be the case forever, though. In a recent paper, Google introduced the Generalist Language Model (GLaM), which the company claims is one of the most efficient LLMs of its size and type. Despite containing 1.2 trillion parameters -- nearly six times the amount in GPT-3 (175 billion) -- Google says that GLaM improves across popular language benchmarks while using "significantly" less computation during inference.

"Our large-scale ... language model, GLaM, achieves competitive results on zero-shot and one-shot learning and is a more efficient model than prior monolithic dense counterparts," the Google researchers behind GLaM wrote in a blog post. "We hope that our work will spark more research into compute-efficient language models."

Sparsity vs. density

In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. DeepMind's recently detailed Gopher model has 280 billion parameters, while Microsoft's and Nvidia's Megatron 530B boasts 530 billion. Both are among the top -- if not the top -- performers on key natural language benchmark tasks including text generation.

But training a model like Megatron 530B requires hundreds of GPU- or accelerator-equipped servers and millions of dollars. It's also bad for the environment. GPT-3 alone used 1,287 megawatts during training and produced 552 metric tons of carbon dioxide emissions, a Google study found. That's roughly equivalent to the yearly emissions of 58 homes in the U.S.

What makes GLaM different from most LLMs to date is its "mixture of experts" (MoE) architecture. An MoE can be thought of as having different layers of "submodels," or experts, specialized for different text. The experts in each layer are controlled by a "gating" component that taps the experts based on the text. For a given word or part of a word, the gating component selects the two most appropriate experts to process the word or word part and make a prediction (e.g., generate text).

The full version of GLaM has 64 experts per MoE layer with 32 MoE layers in total, but only uses a subnetwork of 97 billion (8% of 1.2 trillion) parameters per word or word part during processing. "Dense" models like GPT-3 use all of their parameters for processing, significantly increasing the computational -- and financial -- requirements. For example, Nvidia says that processing with Megatron 530B can take over a minute on a CPU-based on-premises server. It takes half a second on two Nvidia -designed DGX systems, but just one of those systems can cost $7 million to $60 million.

GLaM isn't perfect -- it exceeds or is on par with the performance of a dense LLM in between 80% and 90% (but not all) of tasks. And GLaM uses more computation during training, because it trains on a dataset with more words and word parts than most LLMs. (Versus the billions of words from which GPT-3 learned language, GLaM ingested a dataset that was initially over 1.6 trillion words in size.) But Google claims that GLaM uses less than half the power needed to train GPT-3 at 456-megawatt hours (Mwh) versus 1,286 Mwh. For context, a single megawatt is enough to power around 796 homes for a year.

"GLaM is yet another step in the industrialization of large language models. The team applies and refines many modern tweaks and advancements to improve the performance and inference cost of this latest model, and comes away with an impressive feat of engineering," Connor Leahy, a data scientist at EleutherAI, an open AI research collective, told VentureBeat. "Even if there is nothing scientifically groundbreaking in this latest model iteration, it shows just how much engineering effort companies like Google are throwing behind LLMs."

Future work

GLaM, which builds on Google's own Switch Transformer, a trillion-parameter MoE detailed in January, follows on the heels of other techniques to improve the efficiency of LLMs. A separate team of Google researchers has proposed fine-tuned language net (FLAN), a model that bests GPT-3 "by a large margin" on a number of challenging benchmarks despite being smaller (and more energy-efficient). DeepMind claims that another of its language models, Retro, can beat LLMs 25 times its size, thanks to an external memory that allows it to look up passages of text on the fly.

Of course, efficiency is just one hurdle to overcome where LLMs are concerned. Following similar investigations by AI ethicists Timnit Gebru and Margaret Mitchell, among others, DeepMind last week highlighted a few of the problematic tendencies of LLMs, which include perpetuating stereotypes, using toxic language, leaking sensitive information, providing false or misleading information, and performing poorly for minority groups.

Solutions to these problems aren't immediately forthcoming. But the hope is that architectures like MoE (and perhaps GLaM-like models) will make LLMs more accessible to researchers, enabling them to investigate potential ways to fix -- or at the least, mitigate -- the worst of the issues.

For AI coverage, send news tips to Kyle Wiggers — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer

Sparsity vs. density

Future work

More