OpenAI rolls out new text-generating models that it claims are less toxic

Large language models (LLMs) such as OpenAI's GPT-3, which can "write" sentences that read nearly like they were written by a human, can be prompted to perform a range of writing tasks given only a few examples of the tasks. For example, LLMs have been used to create marketing materials and video game levels in addition to recipes, poetry, and movie scripts. But because LLMs learn to write from examples taken from sometimes toxic communities, they can fall victim to parroting misinformation, sexism, ageism, racism, and conspiracies.

Efforts have been made to combat toxicity in LLMs -- with mixed results. But OpenAI claims that it's developed a new family of models, InstructGPT, that are less likely to generate problematic text while more closely aligning with a user's intent. After piloting InstructGPT with select customers using the OpenAI API, the company's AI-as-a-service API, last year, OpenAI is making the new models the default on the API for text generation.

"Although [AI systems are] quite smart today, they don't always do what we want them to do. The goal of alignment is to produce AI systems that do [achieve] what we want them to," OpenAI cofounder and chief scientist Ilya Sutskever told VentureBeat in a phone interview. "[T]hat becomes more important as AI systems become more powerful."

InstructGPT

LLMs don't write in the same way that humans do. Rather, they learn how likely words are to occur in a body of text -- usually a sentence -- based on examples of text. Simpler models look at the context of a sequence of words whereas larger models work at the level of whole sentences or paragraphs. Examples come in the form of documents within training datasets, which contain terabytes to petabytes of data scraped from social media, Wikipedia, books, software hosting platforms like GitHub, and other sources on the public web.

Researchers including those at OpenAI attempt to scrub the training datasets of problematic content. But some inevitably slips through, leading the models to produce toxic text. For example, OpenAI itself and others have noted that GPT-3 places words like "naughty" or "sucked" near female pronouns, "Islam" near words like "terrorism," and "Jews" near "money." In tests of a chatbot built using GPT-3, the model responded to a hypothetical suicidal patient by encouraging them to kill themselves. GPT-3 has racist and sexist tendencies, for example associating men with higher-earning occupations. And the model can aptly generate disinformation, misinformation, and falsehoods.

OpenAI -- which claims it closely monitors usage of GPT-3 through its API -- made efforts to address the problems ahead of GPT-3's general availability in November 2021. A content filter aims to detect generated text that could be "sensitive" coming from the API, while new endpoints allow developers to provide additional context for apps that require "high accuracy" generations based on sources of truth (e.g., documentation and knowledge bases). Recently, OpenAI also began testing a way to improve the behavior of GPT-3 by fine-tuning it on a "values-targeted" dataset designed to dictate the tone and personality of the text the model generates.

But InstructGPT goes further.

Examples of text generated by GPT-3 versus InstructGPT, given the same prompt.

The researchers behind InstructGPT used what they call "reinforcement learning from human feedback," or RLHF, to make GPT-3 more accurate and potentially less toxic in its output. Originally developed to train AI to control robots and beat human players at video games, RLHF has recently been applied to fine-tuning LLMs for tasks like summarizing essays and news articles. For example, in September, OpenAI unveiled a model trained using reinforcement learning that can summarize books of any length. While the model was limited in the genres it could understand and wasn't always accurate in its summarizations, the researchers behind it claimed at the time that it was among the best-performing approaches to date.

In developing the InstructGPT models, OpenAI researchers collected a dataset of human-written "demonstrations" on prompts submitted to the OpenAI API and some prompts written by a team of human labelers. They used the dataset to train baseline GPT-3 models and then created a second dataset of human-labeled comparisons between outputs from the GPT-3 models on a larger set of OpenAI API prompts. After training a "reward model" (RM) on this second dataset to predict which GPT-3 outputs the labelers would prefer, the researchers used the RM to fine-tune the GPT-3 models to maximize the effect.

To measure the differences between InstructGPT and vanilla GPT-3, the researchers had labelers rate the quality of the models' outputs on a test set of prompts as well as prompts submitted to GPT-3 models on the OpenAI API. The labelers "significantly" preferred InstructGPT over GPT-3, OpenAI claims -- specifically because InstructGPT tended to write fewer untrue statements (a phenomenon in AI known as "hallucination") and better followed constraints in instructions (e.g., "Write your answer in 2 paragraphs or less").

InstructGPT can also generalize to tasks it wasn't explicitly trained to do, like following instructions in other languages (though it sometimes generates outputs in English) and answering questions about computer code. It also shows small improvements in toxicity over GPT-3. And perhaps most impressively, small InstructGPT models -- 100 times smaller than GPT-3 -- generate better-aligned text than GPT-3.

Not perfect

But while InstructGPT improves on GPT-3 in certain areas, it's by no stretch of the imagination perfect. The researchers acknowledge that it can be overly deferential, confused by instructions that assume false premises, and struggle to follow instructions with multiple constraints (e.g., "List 10 movies made in the 1930’s set in France"). It's also no less biased than GPT-3, and, because it's better at following intentions, it's also theoretically easier to misuse.

"There's some important limitations to this work, but we've successfully reduced the frequency with which the models ... make silly mistakes and generate toxic outputs," OpenAI researcher Jan Leike told VentureBeat. "[Of course, InstructGPT] can still be susceptible to misuse especially if you instruct [it] to give you, for example, bad advice or harmful advice."

Connor Leahy, a cofounder of the open AI research organization EleutherAI, who wasn't involved with OpenAI's research, said that the performance is "impressive" and "really shows how far RLHF has come" as a technique. "[InstructGPT is] interesting because it's one of the first applications of RLHF to an actual consumer-facing thing, It seems likely that RLHF will become a standard method for many tasks in the future," he added.

According to Leike, the OpenAI API customers who've been given access to InstructGPT already prefer it to GPT-3. Prior to InstructGPT becoming the default model, about half of the API's traffic was through InstructGPT.

The process of developing the InstructGPT models.

"What we've shown is that you can use very simple techniques to greatly increase the alignment of GPT-3, and, as a result, it follows the intention of whoever uses it much more closely," Sutskever said. "The summary here is that, from a simple market perspective, the safer, more aligned models are are doing much better ... Moving forward into the future, working on alignment will only keep increasing the capability, intelligence, and scope of our neural networks."

Leike believes that RLHF can be applied to other types of models to mitigate toxicity, including beyond pure language models. "You can apply them to any kind of language language model [and to] all sorts of tasks [including] copywriting classification, summarization, question answering, and so on," he said. "[That's why it's] so exciting."

The enthusiasm around RLHF might be justified. But many questions remain, given how some detoxification techniques in the past have ultimately fallen short of expectations. For example, the OpenAI researchers didn't investigate how the labelers might've introduced bias into the InstructGPT training and evaluation process. Research has shown that annotators with different backgrounds, experiences, and native languages classify toxicity differently, with the average annotator being likely to label phrases in African-American Vernacular English (AAVE) -- the informal grammar, vocabulary, and accent used by some Black Americans -- as toxic.

RLHF is also limited to language models for now, leaving the problem of toxicity in multimodal models -- models that can understand images, videos, and audio in addition to text -- unaddressed. OpenAI's CLIP, a model trained to associate visual imagery with text, at times horrifyingly misclassifies images of Black people as "non-human" and teenagers as "criminals" and "thieves." It also shows prejudice toward certain genders, associating appearance-related (e.g., "brown hair," "blonde") and jobs like "nanny" with pictures of women.

That's all to say that toxicity in AI -- whether involving language or otherwise -- is far from a solved problem. As OpenAI concedes, there is much work to be done.

InstructGPT

Not perfect

More