AI Weekly: Meet the people trying to replicate and open-source OpenAI's GPT-3

In June, OpenAI published a paper detailing GPT-3, a machine learning model that achieves strong results on a number of natural language benchmarks. At 175 billion parameters -- the part of the model that has learned from historical training data -- it's one of the largest of its kind. It's also among the most sophisticated, with the ability to make primitive analogies, write in the style of Chaucer, and even complete basic code.

In contrast to GPT-3's predecessors, GPT-2 and GPT-1, OpenAI chose not to open-source the model or training dataset, opting instead to make the former available through a commercial API. The company further curtailed access by choosing to exclusively license GPT-3 to Microsoft, which OpenAI has a business relationship with. Microsoft has invested $1 billion in OpenAI and built an Azure-hosted supercomputer designed to further OpenAI's research.

Several efforts to recreate GPT-3 in open source have emerged, but perhaps the furthest along is GPT-Neo, a project spearheaded by EleutherAI. A grassroots collection of researchers working to open-source machine learning research, EleutherAI and its founding members -- Connor Leahy, Leo Gao, and Sid Black -- aim to deliver the code and weights needed to run a model similar, though not identical, to GPT-3 as soon as August. (Weights are parameters within a neural network that transform input data.)

EleutherAI

According to Leahy, EleutherAI began as "something of a joke" on TPU Podcast, a machine learning Discord server, where he playfully suggested someone should try to replicate GPT-3. Leahy, Gao, and Black took this to its logical extreme and founded the EleutherAI Discord server, which became the base of the organization's operations.

"I consider GPT-3 and other similar results to be strong evidence that it may indeed be possible to create [powerful models] with nothing more than our current techniques," Leahy told VentureBeat in an interview. "It turns out to be in fact very, very hard, but not impossible with a group of smart people, as EleutherAI has shown, and of course with access to unreasonable amounts of computer hardware."

As part of a personal project, Leahy previously attempted to replicate GPT-2, leveraging access to compute through Google's Tensorflow Research Cloud (TFRC) program. The original codebase, which became GPT-Neo, was built to run on tensor processing units (TPUs), Google's custom AI accelerator chips. But the EleutherAI team concluded that even the generous amount of TPUs provided through TFRC wouldn't be sufficient to train the GPT-3-like version of GPT-Neo in under two years.

EleutherAI's fortunes changed when the company was approached by CoreWeave, a U.S.-based cryptocurrency miner that provides cloud services for CGI rendering and machine learning workloads. Last month, CoreWeave offered the EleutherAI team access to its hardware in exchange for an open source GPT-3-like model its customers could use and serve.

Leahy insists that the work, which began around Christmas, won't involve money or other compensation going in either direction. "CoreWeave gives us access to their hardware, we make an open source GPT-3 for everyone to use (and thank them very loudly), and that's all," he said.

Training datasets

EleutherAI concedes that because of OpenAI's decision not to release some key details of GPT-3's architecture, GPT-Neo will deviate from it in at least those ways. Other differences might arise from the training dataset EleutherAI plans to use, which was curated by a team of 10 people at EleutherAI, including Leahy, Gao, and Black.

Language models like GPT-3 often amplify biases encoded in data. A portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. OpenAI notes that this can lead to placing words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism." Other studies, like one published in April by Intel, MIT, and the Canadian Institute for Advanced Research (CIFAR) researchers, have found high levels of stereotypical bias in some of the most popular models, including Google's BERT and XLNet, OpenAI's GPT-2, and Facebook's RoBERTa. Malicious actors could leverage this bias to foment discord by spreading misinformation, disinformation, and outright lies that "radicalize individuals into violent far-right extremist ideologies and behaviors," according to the Middlebury Institute of International Studies.

For their part, the EleutherAI team says they've performed "extensive bias analysis" on the GPT-Neo training dataset and made "tough editorial decisions" to exclude some datasets they felt were "unacceptably negatively biased" toward certain groups or views. The Pile, as it's called, is an 835GB corpus consisting of 22 smaller datasets combined to ensure broad generalization abilities.

"We continue to carefully study how our models act in various circumstances and how we can make them more safe," Leahy said.

Leahy personally disagrees with the idea that releasing a model like GPT-3 would have a direct negative impact on polarization. An adversary seeking to generate extremist views would find it much cheaper and easier to hire a troll farm, he argues, as autocratic governments have already done. Furthermore, Leahy asserts that discussions of discrimination and bias point to a real issue but don't offer a complete solution. Rather than censoring the input data of a model, he says the AI research community must work toward systems that can "learn all that can be learned about evil and then use that knowledge to fight evil and become good."

"I think the commoditization of GPT-3 type models is part of an inevitable trend in the falling price of the production of convincing digital content that will not be meaningfully derailed whether we release a model or not," Leahy continued. "The biggest influence we can have here is to allow more low-resource users, especially academics, to gain access to these technologies to hopefully better study them, and also perform our own brand of safety-focused research on it, instead of having everything locked inside industry labs. After all, this is still ongoing, cutting-edge research. Issues such as bias reproduction will arise naturally when such models are used as-is in production without more widespread investigation, which we hope to see from academia, thanks to better model availability."

Google recently fired AI ethicist Timnit Gebru, reportedly in part over a research paper on large language models that discussed risks such as the impact of their carbon footprint on marginalized communities. Asked about the environmental impact of training GPT-Neo, Leahy characterized the argument as a "red herring," saying he believes it's a matter of whether the ends justify the means -- that is, whether the output of the training is worth the energy put into it.

"The amount of energy that goes into training such a model is much less than, say, the energy that goes into serving any medium-sized website, or a single trans-Atlantic flight to present a paper about the carbon emissions of AI models at a conference, or, God forbid, Bitcoin mining," Leahy said. "No one complains about the energy bill of CERN (The European Organization for Nuclear Research), and I don't think they should, either."

Future work

EleutherAI plans to use architectural tweaks the team has found to be useful to train GPT-Neo, which they expect will enable the model to achieve performance "similar" to GPT-3 at roughly the same size (around 350GB to 700GB of weights). In the future, they plan to distill the final model down "an order of magnitude or so smaller" for easier inference. And while they're not planning to provide any kind of commercial API, they expect CoreWeave and others to set up services to make GPT-Neo accessible to users.

As for the next iteration of GPT and similarly large, complex models, like Google's trillion-parameter Switch-C, Leahy thinks they'll likely be more challenging to replicate. But there's evidence that efficiency improvements might offset the mounting compute requirements. An OpenAI survey found that since 2012, the amount of compute needed to train an AI model to the same performance classifying images in a popular benchmark (ImageNet) has been decreasing by a factor of two every 16 months. But the extent to which compute contributes to performance compared with novel algorithmic approaches remains an open question.

"It seems inevitable that models will continue to increase in size as long as increases in performance follow," Leahy said. "Sufficiently large models will, of course, be out of reach for smaller actors, but this seems to me to just be a fact of life. There seems to me to be no viable alternative. If bigger models equals better performance, whoever has the biggest computer will make the biggest model and therefore have the best performance, easy as that. I wish this wasn't so, but there isn't really anything that can be done about it."

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers and AI editor Seth Colaner -- and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer

EleutherAI

Training datasets

Future work

More