Microsoft and Nvidia team up to train one of the world's largest language models

Microsoft and Nvidia today announced that they trained what they claim is the largest and most capable AI-powered language model to date: Megatron-Turing Natural Language Generation (MT-NLG). The successor to the companies' Turing NLG 17B and Megatron-LM models, MT-NLG contains 530 billion parameters and achieves "unmatched" accuracy in a broad set of natural language tasks, Microsoft and Nvidia say — including reading comprehension, commonsense reasoning, and natural language inferences.

"The quality and results that we have obtained today are a big step forward in the journey towards unlocking the full promise of AI in natural language. The innovations of DeepSpeed and Megatron-LM will benefit existing and future AI model development and make large AI models cheaper and faster to train," Nvidia's senior director of product management and marketing for accelerated computing, Paresh Kharya, and group program manager for the Microsoft Turing team, Ali Alvi wrote in a blog post. "We look forward to how MT-NLG will shape tomorrow’s products and motivate the community to push the boundaries of natural language processing (NLP) even further. The journey is long and far from complete, but we are excited by what is possible and what lies ahead."

Training massive language models

In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. Language models with large numbers of parameters, more data, and more training time have been shown to acquire a richer, more nuanced understanding of language, for example gaining the ability to summarize books and even complete programming code.

To train MT-NLG, Microsoft and Nvidia say that they created a training dataset with 270 billion tokens from English-language websites. Tokens, a way of separating pieces of text into smaller units in natural language, can either be words, characters, or parts of words. Like all AI models, MT-NLG had to "train" by ingesting a set of examples to learn patterns among data points, like grammatical and syntactical rules.

The dataset largely came from The Pile, an 835GB collection of 22 smaller datasets created by the open source AI research effort EleutherAI. The Pile spans academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), and more, which Microsoft and Nvidia say they curated and combined with filtered snapshots of the Common Crawl, a large collection of webpages including news stories and social media posts.

Training took place across 560 Nvidia DGX A100 servers, each containing 8 Nvidia A100 80GB GPUs.

When benchmarked, Microsoft says that MT-NLG can infer basic mathematical operations even when the symbols are "badly obfuscated." While not extremely accurate, the model seems to go beyond memorization for arithmetic and manages to complete tasks containing questions that prompt it for an answer, a major challenge in NLP.

It’s well-established that models like MT-NLG can amplify the biases in data on which they were trained, and indeed, Microsoft and Nvidia acknowledge that the model "picks up stereotypes and biases from the [training] data." That's likely because a portion of the dataset was sourced from communities with pervasive gender, race, physical, and religious prejudices, which curation can't completely address.

In a paper, the Middlebury Institute of International Studies’ Center on Terrorism, Extremism, and Counterterrorism claim that GPT-3 and similar models can generate "informational" and "influential" text that might radicalize people into far-right extremist ideologies and behaviors. A group at Georgetown University has used GPT-3 to generate misinformation, including stories around a false narrative, articles altered to push a bogus perspective, and tweets riffing on particular points of disinformation. Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular open source models, including Google’s BERT, XLNet, and Facebook’s RoBERTa.

Microsoft and Nvidia claim that they're "committed to working on addressing [the] problem" and encourage "continued research to help in quantifying the bias of the model." They also say that any use of Megatron-Turing in production "must ensure that proper measures are put in place to mitigate and minimize potential harm to users," and follow tenets such as those outlined in Microsoft's Responsible AI Principles.

"We live in a time [when] AI advancements are far outpacing Moore’s law. We continue to see more computation power being made available with newer generations of GPUs, interconnected at lightning speeds. At the same time, we continue to see hyper-scaling of AI models leading to better performance, with seemingly no end in sight," Kharya and Alvi continued. "Marrying these two trends together are software innovations that push the boundaries of optimization and efficiency."

The cost of large models

Projects like MT-NLG, AI21 Labs' Jurassic-1, Huawei's PanGu-Alpha, Naver's HyperCLOVA, and the Beijing Academy of Artificial Intelligence's Wu Dao 2.0 are impressive from an academic standpoint, but building them doesn't come cheap. For example, the training dataset for OpenAI's GPT-3 -- one of the world's largest language models — was 45 terabytes in size, enough to fill 90 500GB hard drives.

AI training costs dropped 100-fold between 2017 and 2019, according to one source, but the totals still exceed the compute budgets of most startups. The inequity favors corporations with extraordinary access to resources at the expense of small-time entrepreneurs, cementing incumbent advantages.

For example, OpenAI's GPT-3 required an estimated 3.1423^23 floating-point operations per second (FLOPS) of compute during training. In computer science, FLOPS is a measure of raw processing performance, typically used to compare different types of hardware. Assuming OpenAI reserved 28 teraflops -- 28 trillion floating-point operations per second — of compute across a bank of Nvidia V100 GPUs, a common GPU available through cloud services, it'd take $4.6 million for a single training run. One Nvidia RTX 8000 GPU with 15 teraflops of compute would be substantially cheaper — but it'd take 665 years to finish the training.

Microsoft and Nvidia says that it observed between 113 to 126 teraflops per second per GPU while training MT-NLG. The cost is likely to have been in the millions of dollars.

A Synced report estimated that a fake news detection model developed by researchers at the University of Washington cost $25,000 to train, and Google spent around $6,912 to train a language model called BERT that it used to improve the quality of Google Search results. Storage costs also quickly mount when dealing with datasets at the terabyte — or petabyte — scale. To take an extreme example, one of the datasets accumulated by Tesla's self-driving team — 1.5 petabytes of video footage — would cost over $67,500 to store in Azure for three months, according to CrowdStorage.

The effects of AI and machine learning model training on the environment have also been brought into relief. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly five times the lifetime emissions of the average U.S. car. OpenAI itself has conceded that models like Codex require significant amounts of compute — on the order of hundreds of petaflops per day — which contributes to carbon emissions.

In a sliver of good news, the cost for FLOPS and basic machine learning operations has been falling over the past few years. A 2020 OpenAI survey found that since 2012, the amount of compute needed to train a model to the same performance on classifying images in a popular benchmark — ImageNet — has been decreasing by a factor of two every 16 months. Other recent research suggests that large language models aren’t always more complex than smaller models, depending on the techniques used to train them.

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says when it comes to natural language, it’s an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain.

"The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets," Antoniak told VentureBeat in a previous interview. "These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate."

Training massive language models

The cost of large models

More