EleutherAI claims new NLP model approaches GPT-3-level performance

AI-powered language systems have transformative potential, particularly in the enterprise. They're already being used to drive chatbots, translate natural language into structured query language, create application layouts and spreadsheets, and improve the accuracy of web search products. OpenAI's GPT-3, which may be the best-known AI text-generator, is currently used in more than 300 apps by tens of thousands of developers and producing 4.5 billion words per day.

As business interest in AI rises, advisory firm Mordor Intelligence forecasts that the natural language processing (NLP) market will more than triple its revenue by 2025. But noncommercial, open source efforts are concurrently gaining steam, as evidenced by the progress made by EleutherAI. A grassroots collection of AI researchers, EleutherAI this week released GPT-J-6B (GPT-J), a model the group claims performs nearly on par with an equivalent-sized GPT-3 model on various tasks. Contributor Ben Wang led the work.

"We think it's probably fair to say this is currently the best open source autoregressive language model you can get by a pretty wide margin," Connor Leahy, one of the founding members of EleutherAI, told VentureBeat.

GPT-J is what's known as a Transformer model, which means it weighs the influence of different parts of input data rather than treating all the input data the same. Transformers don't need to process the beginning of a sentence before the end. Instead, they identify the context that confers meaning on a word in the sentence, enabling them to process input data in parallel.

The Transformer architecture forms the backbone of language models that include GPT-3 and Google's BERT, but EleutherAI claims GPT-J took less time to train compared with other large-scale model developments. The researchers attribute this to the use of Jax, DeepMind's Python library designed for machine learning research, as well as training on Google's tensor processing units (TPU), application-specific integrated circuits (ASICs) developed specifically to accelerate AI.

Training GPT-J

EleutherAI says GPT-J contains roughly 6 billion parameters, the parts of the machine learning model learned from historical training data. It was trained over the course of five weeks on 400 billion tokens from a dataset created by EleutherAI called The Pile, an 835GB collection of 22 smaller datasets -- including academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), and more. (Tokens are a way of separating pieces of text into smaller units in natural language, and they can be words, characters, or parts of words.)

For compute, EleutherAI was able to leverage the TPU Research Cloud, a Google Cloud initiative that supports projects with the expectation that the results of the research will be shared via code and models. GPT-J's code and the trained model are open-sourced under the Apache 2.0 license and can be used for free via EleutherAI's website.

GPT-J is more capable than the two previously released EleutherAI models: GPT-Neo 1.3B and GPT-Neo 2.7B. For example, it can perform addition and subtraction and prove simple mathematical theorems, like "Any cyclic group is abelian." It can also answer quantitative reasoning questions from a popular test dataset (BoolQ) and generate pseudocode.

"[OpenAI's] GPT-2 was about 1.5 billion parameters and doesn't have the best performance since it's a bit old. GPT-Neo was about 2.7 billion parameters but somewhat underperforms equal-sized GPT-3 models. GPT-J, the new one, is now 6B -- sized similar to the Curie model of OpenAI, we believe," Leahy said.

Looking ahead

EleutherAI plans to eventually deliver the code and weights needed to run a model similar, though not identical, to the full "DaVinci" GPT-3. (Weights are parameters within a neural network that transform input data.) Compared with GPT-J, the full GPT-3 contains 175 billion parameters and was trained on 499 billion tokens from a 45TB dataset.

Language models like GPT-3 often amplify biases encoded in data. A portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. OpenAI notes that this can lead to placing words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism." Other studies, like one published in April by Intel, MIT, and the Canadian Institute for Advanced Research (CIFAR) researchers, have found high levels of stereotypical bias in some of the most popular models.

But EleutherAI claims to have performed "extensive bias analysis" on The Pile and made "tough editorial decisions" to exclude datasets they felt were "unacceptably negatively biased" toward certain groups or views.

While EleutherAI's model might not be cutting edge in terms of its capabilities, it could go a long way toward solving a common tech problem: the disconnect between research and engineering teams. As Hugging Face CEO Clément Delangue told VentureBeat in a recent interview, tech giants provide black-box NLP APIs while also releasing open source repositories that can be hard to use or aren't well-maintained. EleutherAI's efforts could help enterprises realize the business value of NLP without having to do much of the legwork themselves.

Training GPT-J

Looking ahead

More