AI Weekly: The promise and shortcomings of OpenAI's GPT-3

I usually think of the dog days of summer as a time when news slows down. It's typically when a lot of people take time off work and the lull leads local news stations to cover inconsequential things like cat shows or a little baby squirrel on a little baby Jet Ski. But these are not typical times.

Facebook continues to face fallout from bias and discrimination issues, with multiple news outlets reporting that Instagram's content moderation algorithm was 50% more likely to flag and disable the accounts of Black users than White users. Facebook and Instagram are now creating teams to examine how algorithms impact the experiences of Black and Latinx users, as well as users from other specific groups.

Also this week: Executives from Amazon, Google, and Microsoft gave leaders in Washington more than 30 recommendations to help the U.S. maintain an edge over other nations in AI. Recommendations include recruiting AI practitioners for a reserve corps that would do part-time government work and creating an accredited academy for the U.S. government to train AI talent.

But arguably the biggest story this week was the beta release of GPT-3, a language model capable of a great range of tasks, like summarization, text generation to write articles, and translation. Tests made especially to analyze GPT-3 found it can also complete many other tasks, like unscrambling words and using words it has only seen defined once in sentences.

In recent weeks, OpenAI extended access to an API and the language model with 175 billion parameters trained on a corpus of text from the web, which includes about a trillion words. Apps like a layout generator that creates code from natural language descriptions got a lot of attention, as did apps for answering people's questions or creating U.S. history test questions and answers. A generator that identifies the relationship between real-world objects offered a potential application to help robots or other forms of AI better understand the world. One early GPT-3 user felt a chat he had about God and existence and the universe was so profound, he said "You will become another person after reading it." A particularly gushing Bloomberg story titled "Artificial intelligence is the hope 2020 needs" suggested GPT-3 could end up becoming one of the biggest news stories of 2020.

Some discussion around the release of GPT-3 also raised the question of why OpenAI seems less concerned about sharing the much larger GPT-3 than it was about GPT-2, a model OpenAI controversially chose not to initially share publicly due to its potentially negative impact on things like the spread of fake news.

OpenAI's release timing has been in line with its broader business plan. For context, the GPT-2 release came a month before OpenAI changed its business structure and created a for-profit company. GPT-3 was released less than two weeks before the introduction of the OpenAI API to commercialize its AI.

Emily Bender is a professor, a linguist, and a member of the University of Washington's NLP group. Last month, a paper she coauthored about large language models like GPT-3 argued the hype around such models shouldn't mislead people into believing the language models are capable of understanding or meaning. The paper won an award from the Association of Computational Linguistics conference.

"While large neural language models may well end up being important components of an eventual full-scale solution to human-analogous natural language understanding, they are not nearly-there solutions to this grand challenge," the paper reads.

Bender hasn't tested GPT-3 personally, but she said from what she's seen it is impressive, but with roughly the same architecture as GPT-2. The main difference is its massive scale.

"It's shiny and big and flashy, and it's not different in kind, either in the overall approach or in the risks that it brings along," she said. "I think that there's a fundamental problem in an approach to what gets called artificial intelligence that relies on data sets that are larger than humans can actually manually verify."

Circulating among the free publicity for OpenAI generated by early access users are some examples that demonstrate its predictable bias. Facebook AI head Jerome Pesenti found a rash of negative statements from AI created to generate humanlike tweets that targeted Black people, Jewish people, and women. Of course, that's not a surprise. Tests included in the release of a paper in late May found that GPT-3 demonstrates gender and racial bias and is most likely to give Asian people a high sentiment analysis and Black people a low sentiment analysis score, particularly among smaller versions of the model. OpenAI analysis also demonstrated shortcomings in specific tasks, like word-in-context analysis (WiC) and RACE, a set of middle school and high school exam questions.

Tests earlier this year found that many popular language models trained with a large data corpus, like Google's BERT and GPT-2, demonstrate several forms of bias. Bender, who teaches an NLP ethics course at the University of Washington, said there's no such thing as an unbiased data set or a bias-free model and that even carefully created language data sets can carry subtler forms of bias. But she maintains some best practices could reduce bias in large data sets.

OpenAI is implementing testing in beta as a safeguard, which may help unearth issues, a spokesperson said, adding that the company is applying toxicity filters to GPT-3. The spokesperson declined to share additional information about what the filters might accomplish but said more details will be shared in the weeks ahead.

GPT-3 understandably generates marvel in some people, as it appears to draw closer to the idea of a general model that can do virtually anything with just a few samples of training data. OpenAI CEO Sam Altman tweeted that a 10-year-old boy he showed GPT-3 to said in a matter of seconds that he wanted to enter the AI field.

But Altman also said in a tweet Sunday that "The GPT-3 hype is way too much. It's impressive (thanks for the nice compliments!), but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out."

The OpenAI paper said the approach taken to characterize some attributes of the model was inspired by the model cards for model reporting method created by Google AI ethics researchers.

Alongside the need to adopt data sheets or data statements to better understand the contents of data sets, Bender emphasized that more testing is needed in the NLP field to be able to really understand when models are demonstrating an understanding or approaching other grand challenges.

"What's happened culturally recently ... within NLP in the last maybe 10-15 years, there's been a lot of emphasis on valuing models and model building, and the only value assigned to work around evaluation metrics and task design and annotation is as [a] subsidiary to the model building to allow the model builders to show how good their models are," she said. "And that's an imbalanced situation, where we can't do good science. I hope that we're going to see an increased value placed on the other parts of the science, which isn't to say that we're done building models. I'm sure there's more research to be done there, but we can't make meaningful progress in model building if we can't do meaningful testing of the models, and we can't do meaningful testing of the models if it's not valued."

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers and AI editor Seth Colaner -- and be sure to subscribe to the AI Weekly newsletter and bookmark our AI Channel.

Thanks for reading,

Khari Johnson

Senior AI Staff Writer

More