AI21 Labs trains a massive language model to rival OpenAI's GPT-3

For the better part of a year, OpenAI's GPT-3 has remained among the largest AI language models ever created, if not the largest of its kind. Via an API, people have used it to automatically write emails and articles, summarize text, compose poetry and recipes, create website layouts, and generate code for deep learning in Python. But an AI lab based in Tel Aviv, Israel -- AI21 Labs -- says it's planning to release a larger model and make it available via a service, with the idea being to challenge OpenAI's dominance in the "natural language processing-as-a-service" field.

AI21 Labs, which is advised by Udacity founder Sebastian Thrun, was cofounded in 2017 by Crowdx founder Ori Goshen, Stanford University professor Yoav Shoham, and Mobileye CEO Amnon Shashua. The startup says that the largest version of its model -- called Jurassic-1 Jumbo -- contains 178 billion parameters, or 3 billion more than GPT-3 (but not more than PanGu-Alpha, HyperCLOVA, or Wu Dao 2.0). In machine learning, parameters are the part of the model that's learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well.

AI21 Labs claims that Jurassic-1 can recognize 250,000 lexical items including expressions, words, and phrases, making it bigger than most existing models including GPT-3, which has a 50,000-item vocabulary. The company also claims that Jurassic-1 Jumbo's vocabulary is among the first to span "multi-word" items like named entities -- "The Empire State Building," for example -- meaning that the model might have a richer semantic representation of concepts that make sense to humans.

"AI21 Labs was founded to fundamentally change and improve the way people read and write. Pushing the frontier of language-based AI requires more than just pattern recognition of the sort offered by current deep language models," CEO Shoham told VentureBeat via email.

Scaling up

The Jurassic-1 models will be available via AI21 Labs' Studio platform, which lets developers experiment with the model in open beta to prototype applications like virtual agents and chatbots. Should developers wish to go live with their apps and serve "production-scale" traffic, they'll be able to apply for access to custom models and get their own private fine-tuned model, which they'll be able to scale in a "pay-as-you-go" cloud services model.

"Studio can serve small and medium businesses, freelancers, individuals, and researchers on a consumption-based ... business model. For clients with enterprise-scale volume, we offer a subscription-based model. Customization is built into the offering. [The platform] allows any user to train their own custom model that's based on Jurassic-1 Jumbo, but fine-tuned to better perform a specific task," Shoham said. "AI21 Labs handles the deployment, serving, and scaling of the custom models."

AI21 Labs' first product was Wordtune, an AI-powered writing aid that suggests rephrasings of text wherever users type. Meant to compete with platforms like Grammarly, Wordtune offers "freemium" pricing as well as a team offering and partner integration. But the Jurassic-1 models and Studio are much more ambitious.

Shoham says that the Jurassic-1 models were trained in the cloud with "hundreds" of distributed GPUs on an unspecified public service. Simply storing 178 billion parameters requires more than 350GB of memory -- far more than even the highest-end GPUs -- which necessitated that the development team use a combination of strategies to make the process as efficient as possible.

The training dataset for Jurassic-1 Jumbo, which contains 300 billion tokens, was compiled from English-language websites including Wikipedia, news publications, StackExchange, and OpenSubtitles. Tokens, a way of separating pieces of text into smaller units in natural language, can be either words, characters, or parts of words.

In a test on a benchmark suite that it created, AI21 Labs says that the Jurassic-1 models perform on a par or better than GPT-3 across a range of tasks, including answering academic and legal questions. By going beyond traditional language model vocabularies, which include words and word pieces like "potato" and "make" and "e-," "gal-," and "itarian," Jurassic-1 canvasses less common nouns and turns of phrase like "run of the mill," "New York Yankees," and "Xi Jinping." It's also ostensibly more sample-efficient -- while the sentence "Once in a while I like to visit New York City" would be represented by 11 tokens for GPT-3 ("Once," "in," "a," "while," and so on), it would be represented by just 4 tokens for the Jurassic-1 models.

"Logic and math problems are notoriously hard even for the most powerful language models. Jurassic-1 Jumbo can solve very simple arithmetic problems, like adding two large numbers," Shoham said. "There's a bit of a secret sauce in how we customize our language models to new tasks, which makes the process more robust than standard fine-tuning techniques. As a result, custom models built in Studio are less likely to suffer from catastrophic forgetting, [or] when fine-tuning a model on a new task causes it to lose core knowledge or capabilities that were previously encoded in it."

Connor Leahy, a member of the open source research group EleutherAI, told VentureBeat via email that while he believes there's nothing fundamentally novel about the Jurassic-1 Jumbo model, it's an impressive feat of engineering, and he has "little doubt" it will perform on a par with GPT-3. "It will be interesting to observe how the ecosystem around these models develops in the coming years, especially what kinds of downstream applications emerge as robustly useful," he added. "[The question is] whether such services can be run profitably with fierce competition, and how the inevitable security concerns will be handled."

Open questions

Beyond chatbots, Shoham sees the Jurassic-1 models and Studio being used for paraphrasing and summarization, like generating short product names from product description. The tools could also be used to extract entities, events, and facts from texts and label whole libraries of emails, articles, notes by topic or category.

But troublingly, AI21 Labs has left key questions about the Jurassic-1 models and their possible shortcomings unaddressed. For example, when asked what steps had been taken to mitigate potential gender, race, and religious biases as well as other forms of toxicity in the models, the company declined to comment. It also refused to say whether it would allow third parties to audit or study the models' outputs prior to launch.

This is cause for concern, as it's well-established that models amplify the biases in data on which they were trained. A portion of the data in the language is often sourced from communities with pervasive gender, race, physical, and religious prejudices. In a paper, the Middlebury Institute of International Studies' Center on Terrorism, Extremism, and Counterterrorism claims that GPT-3 and like models can generate "informational" and "influential" text that might radicalize people into far-right extremist ideologies and behaviors. A group at Georgetown University has used GPT-3 to generate misinformation, including stories around a false narrative, articles altered to push a bogus perspective, and tweets riffing on particular points of disinformation. Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular open source models, including Google's BERT and XLNet and Facebook's RoBERTa.

More recent research suggests that toxic language models deployed into production might struggle to understand aspects of minority languages and dialects. This could force people using the models to switch to "white-aligned English" to ensure the models work better for them, or discourage minority speakers from engaging with the models at all.

It's unclear to what extent the Jurassic-1 models exhibit these kinds of biases, in part because AI21 Labs hasn't released -- and doesn't intend to release -- the source code. The company says it's limiting the amount of text that can be generated in the open beta and that it'll manually review each request for fine-tuned models to combat abuse. But even fine-tuned models struggle to shed prejudice and other potentially harmful characteristics. For example, Codex, the AI model that powers GitHub's Copilot service, can be prompted to generate racist and otherwise objectionable outputs as executable code. When writing code comments with the prompt "Islam," Codex often includes the word "terrorist" and "violent" at a greater rate than with other religious groups.

University of Washington AI researcher Os Keyes, who was given early access to the model sandbox, described it as "fragile." While the Jurassic-1 models didn't expose any private data -- a growing problem in the large language model domain -- using preset scenarios, Keyes was able to prompt the models to imply that "people who love Jews are closed-minded, people who hate Jews are extremely open-minded, and a kike is simultaneously a disreputable money-lender and 'any Jew.'"

"Obviously: all models are wrong sometimes. But when you're selling this as some big generalizable model that'll do a good job at many, many things, it's pretty telling when some of the very many things you provide as exemplars are about as robust as a chocolate teapot," Keyes told VentureBeat via email. "What it suggests is that what you are selling is nowhere near as generalizable as you're claiming. And this could be fine -- products often start off with one big idea and end up discovering a smaller thing along the way they're really, really good at and refocusing."

AI21 Labs demurred when asked whether it conducted a thorough bias analysis on the Jurassic-1 models' training datasets. In an email, a spokesperson said that when measured against StereoSet, a benchmark to evaluate bias related to gender, profession, race, and religion in language systems, the Jurassic-1 models were found by the company's engineers to be "marginally less biased" than GPT-3.

Still, that's in contrast to groups like EleutherAI, which have worked to exclude data sources determined to be "unacceptably negatively biased" toward certain groups or views. Beyond limiting text inputs, AI21 Labs isn't adopting additional countermeasures, like toxicity filters or fine-tuning the Jurassic-1 models on "value-aligned" datasets like OpenAI's PALMS.

Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who's disadvantaged. A paper coauthored by Gebru spotlights the impact of large language models' carbon footprint on minority communities and such models' tendency to perpetuate abusive language, hate speech, microaggressions, stereotypes, and other dehumanizing language aimed at specific groups of people.

The effects of AI and machine learning model training on the environment have also been brought into relief. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions of the average U.S. car. OpenAI itself has conceded that models like Codex require significant amounts of compute -- on the order of hundreds of petaflops per day -- which contributes to carbon emissions.

The way forward

The coauthors of the OpenAI and Stanford paper suggest ways to address the negative consequences of large language models, such as enacting laws that require companies to acknowledge when text is generated by AI -- possibly along the lines of California's bot law.

Other recommendations include:

Training a separate model that acts as a filter for content generated by a language model
Deploying a suite of bias tests to run models through before allowing people to use the model
Avoiding some specific use cases

AI21 Labs hasn't committed to these principles, but Shoham stresses that the Jurassic-1 models are only the first in a line of language models that it's working on, to be followed by more sophisticated variants. The company also says that it's adopting approaches to reduce both the cost of training models and their environment impact, as well as working on a suite of natural language processing products of which Wordtune, Studio, and the Jurassic-1 models are only the first.

"We take misuse extremely seriously and have put measures in place to limit the potential harms that have plagued others," Shoham said. "We have to combine brain and brawn: enriching huge statistical models with semantic elements, while leveraging computational power and data at unprecedented scale."

AI21 Labs, which emerged from stealth in October 2019, has raised $34.5 million in venture capital to date from investors including Pitango and TPY Capital. The company has around 40 employees currently, and it plans to hire more in the months ahead.