We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
Let the OSS Enterprise newsletter guide your open source journey! Sign up here.
At its fall 2021 GPU Technology Conference (GTC) today, Nvidia announced that it’s making Megatron 530B, one of the world’s largest language models, available to enterprises for training to serve new domains and languages. First detailed in early October, Megatron 530B — also known as Megatron-Turing Natural Language Generation (MT-NLG) — contains 530 billion parameters and achieves high accuracy in a broad set of natural language tasks, including reading comprehension, commonsense reasoning, and natural language inference.
“Today, we provide recipes for customers to build, train, and customize large language models, including Megatron 530B. This includes scripts, code, and 530B untrained model. Customers can start from smaller models and scale up to larger models as they see fit,” Nvidia VP of AI software product management Kari Briski told VentureBeat via email. “Our researchers [worked] together with Microsoft [to train] the Megatron 530B model in six weeks.”
In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. Language models with large numbers of parameters, more data, and more training time have been shown to acquire a richer, more nuanced understanding of language, for example gaining the ability to summarize books and even complete programming code.
To train Megatron 530B, Nvidia — in partnership with Microsoft — created a training dataset with 270 billion tokens taken from English-language websites. Tokens, a way of separating pieces of text into smaller units in natural language, can either be words, characters, or parts of words. Like all AI models, Megatron 530B had to “train” by ingesting a set of examples to learn patterns among data points, such as basic grammatical and syntactical rules.
The dataset largely came from The Pile, an 835GB collection of 22 smaller datasets created by the open source AI research effort EleutherAI. The Pile spans academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (GitHub), and more, which Microsoft and Nvidia say they curated and combined with filtered snapshots of the Common Crawl, a large collection of webpages including news stories and social media posts.
When benchmarked, Nvidia says that Megatron 530B can infer basic mathematical operations even when the symbols are “badly obfuscated.” While not extremely accurate, the model seems to go beyond memorization for arithmetic and manages to complete tasks containing questions that prompt it for an answer — a major challenge in NLP.
“Customers are eager to invest in large language models for their capabilities on generalized AI with few-shot learning and the ability to excel in many tasks at the same time,” Kari said. “When it comes to conversational AI, this general approach is very exciting for use cases like open domain chat bots, document summarization, text generation, and so on … Megatron 530B is being used internally by Nvidia.”
Training and usage challenges
Given the sheer size of Megatron 530B, training and deploying it into production aren’t easy feats — even for enterprises with massive resources. The model was originally trained across 560 Nvidia DGX A100 servers, each hosting 8 Nvidia A100 80GB GPUs. Microsoft and Nvidia say that they observed between 113 to 126 teraflops per second per GPU while training Megatron 530B, which would put the training cost in the millions of dollars. (A teraflop rating measures the performance of hardware including GPUs.)
Nvidia is pitching its DGX SuperPOD as the preferred solution. A line of servers and workstations, SuperPODs are preconfigured DGX A100 systems built using A100 GPUs and Nvidia Mellanox InfiniBand for the compute and storage fabric.
But a single SuperPOD can cost anywhere from $7 million to $60 million, depending on the size of deployment. (A single DGX A100 starts at $199,000.) Nvidia’s SuperPOD subscription service is substantially cheaper — a SuperPOD runs $90,000 per month. Considering that Megatron 530B was trained on Nvidia’s Selene supercomputer, however, which comprises four SuperPODs with 560 A100 GPUs, the expense is beyond what most companies can afford to pay.
Even tech giants like Google parent company Alphabet have run up against budget constraints while training AI models. When Google subsidiary DeepMind’s researchers designed a model to play StarCraft II, they purposefully didn’t try multiple ways of architecting a key component because the training costs would’ve been too high. Similarly, OpenAI didn’t fix a mistake when it implemented GPT-3 — a language model with less than half as many parameters as Megatron 530B — because the cost of training made retraining the model infeasible.
Still, in a recent interview with Next Platform, Catanzaro says he thinks that it’s entirely possible a company will invest a billion dollars in compute time to train a model within the next five years. A University of Massachusetts Amherst study showed that using 2019-era approaches, training an image recognition model with a 5% error rate would cost $100 billion.
While no enterprise has yet come close, DeepMind reportedly set aside $35 million to train an AI system to learn Go. OpenAI is estimated to have spent $4.6 million to $12 million training GPT-3. And AI21 Labs, which developed a language model roughly the size of GPT-3, raised $34.5 million in venture capital before launching its commercial service.
“With models like [OpenAI’s GPT-3,] we are starting to see models that can go beyond, that can actually become more general purpose tools for solving real-world problems. It’s a step toward a more general form of AI and that justifies the investment in training these enormous language models on clusters like Selene,” Catanzaro said. “These models are so adaptable and flexible and their capabilities have been so correlated with scale we may actually see them providing several billions of dollars worth of value from a single model, so in the next five years, spending a billion in compute to train those could make sense.”
Inference — actually running the trained model — is another challenge. On two DGX systems, Nvidia claims that inference (e.g., autocompleting a sentence) with Megatron 530B only takes half a second. But it can take over a minute on a CPU-based on-premises server. While cloud alternatives might be cheaper, they’re not dramatically so — one estimate pegs the cost of running GPT-3 on a single Amazon Web Services instance at a minimum of $87,000 per year.
That’s perhaps why, with the exception of OpenAI, Microsoft, and AI21 Labs, few companies have made available large trained language models to customers via APIs. Systems like Huawei’s PanGu-Alpha, Naver’s HyperCLOVA, and the Beijing Academy of Artificial Intelligence’s Wu Dao 2.0 remain inaccessible beyond research papers and (in PanGu-Alpha’s case) GitHub repositories.
Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says when it comes to natural language, it’s an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping near-impractically large amounts of data into massive language models is uncertain.
“The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets,” Antoniak told VentureBeat in a previous interview. “These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.”
It’s well-established that models like Megatron 530B can amplify the biases in data on which they were trained, and indeed, Microsoft and Nvidia acknowledge that the model “picks up stereotypes and biases from the [training] data.” That’s likely because a portion of the dataset was sourced from communities with pervasive gender, race, physical, and religious prejudices, which curation can’t completely address.
Microsoft and Nvidia claim that they’re “committed to working on addressing [the] problem” and encourage “continued research to help in quantifying the bias of the model.” They also say that any use of Megatron-Turing in production “must ensure that proper measures are put in place to mitigate and minimize potential harm to users,” and follow tenets such as those outlined in Microsoft’s Responsible AI Principles.
“While giant language models are advancing the state of the art on language generation, they also suffer from issues such as bias and toxicity,” Kari added. “Understanding and removing these problems in language models is under active research by the AI community, including at Nvidia. Nvidia is committed to working on addressing this problem. We encourage continued research to help in quantifying the bias of the model.”
Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models — examining who benefits from them and who is harmed. While bias remains an open challenge, in a sliver of good news, the cost of basic machine learning operations has been falling over the past few years. A 2020 OpenAI survey found that since 2012, the amount of compute needed to train a model to the same performance on classifying images in a popular benchmark — ImageNet — has been decreasing by a factor of two every 16 months. Approaches like network pruning prior to training could lead to further gains.
Whether through pruning, novel hardware, or techniques like meta-learning and neural architecture search, the need for solutions for — or alternatives to — large language models is quickly becoming clear — at least if startups without the resources of large enterprises are to have a fighting chance.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.