OpenAI let us try its state-of-the-art NLP text generator

Language is power, and engineering a system that can comprehend it as well as any human is a grand challenge in AI research. Recent contributions like Google's BERT, a framework that can train state-of-the-art natural language processing (NLP) models in a few hours on a single graphics card, and Facebook's PyText, which produces over a billion daily predictions for the social network's apps and services, have nudged the needle forward. But robots capable of speaking naturally, unaided by handcrafted grammar rules and carefully labeled datasets, remain elusive.

That hasn't discouraged OpenAI, an AI research organization backed by tech luminaries Reid Hoffman and Peter Thiel. Over the course of its roughly four-year history, the San Francisco-based nonprofit has investigated autonomous systems that can achieve superhuman performance in Pong and Montezuma's Revenge and defeat professional Dota players -- not to mention imbuing mechanical hands with humanlike dexterity. OpenAI has also published its fair share of work in NLP, and today it is previewing a collection of AI models that can not only generate coherent text given words or sentences, but achieve state-of-the-art (or near-state-of-the-art) performance on a range of NLP tests.

The pioneering models build on OpenAI's prior research, which suggests that unsupervised learning -- an AI training technique in which machine learning algorithms learn patterns from unclassified, unannotated data -- can be used to orient generic models to specific language tasks. The group's newly published paper posits that sufficiently large language models -- in some cases 12 times the size of OpenAI's previous model -- can learn NLP tasks without domain-specific datasets or modifications.

The models achieve this in part with Transformers, a relatively novel type of neural architecture introduced in a 2017 paper ("Attention Is All You Need") coauthored by scientists at Google Brain, Google's AI research division. The neural networks at the heart of OpenAI's models comprise neurons, or mathematical functions loosely modeled after biological neurons. These neurons are connected with "synapses" that transmit signals to other neurons, and they're arranged in layers. Those signals -- the product of data, or inputs, fed into the neural network -- travel from layer to layer and slowly "tune" the network by adjusting the synaptic strength -- weights -- of each connection. Over time, the network extracts features from the dataset and identifies trends across samples, eventually learning to make predictions.

Transformers add self-attention to the mix. Groupings of neurons transmit signals at different rates than others, and they route the signals to subsequent layers somewhat intelligently. This architectural tweak -- combined with a 40GB corpus of 8 million internet reviews, social media posts, and articles equivalent to about 10 billion words -- enables OpenAI's models to query past learnings for relevant information, like the names of the people mentioned in a story or objects in a room.

"[Some] AI models [are] forced to summarize, or compress, all information about ... [a] context into a single fixing-length [mathematical representation], and this is a very difficult task," Alec Radford, head of research at Indico Data Solutions and language research lead at OpenAI, told VentureBeat in a phone briefing. "[That's in contrast to] a self-attention-based model that keeps ... representations of all the different words in context, and learns [the] functionality of how to query over the past. [If] it sees a word like 'mister' or 'missus', the model can learn how to look back at the documents and all the previous words and figure out which names are likely to follow the title."

OpenAI provided VentureBeat access to two trained models: one that had been fed generic web text, and a second that had been fine-tuned to Amazon product reviews.

We tested the first model with this phrase: "The quick brown fox jumps over the lazy dog."

Here's what it wrote: "The quick brown fox jumps over the lazy dog. This dog looks lazy! 'What's so great about that?' I asked, and the quick brown fox laughed, and the lazy dog licked his lips. 'I don't know. But you won't have to know, because once you're in... ' 'I know,' I said."

Then we entered: "John woke up this morning feeling very hungry."

Here was the response: "John woke up this morning feeling very hungry. This was due in part to the fact that while the restaurant is open through July, the rest of the business was closed for the Thanksgiving break. The meal started off with a salad of fresh vegetables, mixed with cheese and dressing. The potatoes had a nice brown finish and the salad was full of flavor."

Finally, we tried: "This is the story of a little dog named Boo."

Here's what it came up with: "This is the story of a little dog named Boo. You see, Boo is a Labrador, a dog that loves people more than anything else. A little over a year ago, Boo became a part of The Dog Rescue of Arlington, Virginia. Boo was a puppy (now 2 years old) that had been abandoned. Boo's story was heartbreaking."

We then supplied the Amazon model -- which had fields and drop-down menus for specifying the product category, product name, review score, and review summary -- with this: "A great phone."

With the product toggle set to "Cell Phones and Accessories" and the rating set to "5" (i.e., five stars out of five), it generated: "The phone is great! I've had it since July of 2013. I will be purchasing a second one in the near future. Love the ability to text, play music and talk to others via the mic. Voice quality is great. The ear buds are a little uncomfortable after a few hours of wearing them. The only problem I've had are the battery life. I know, that's just a problem with technology, and it will go away after the phone is used for a while. Other than that, I love the phone. It has been the best phone for me. It's an excellent phone. The case is a nice touch, but the plastic part of the phone is the best part of the phone. It gets dirty easily. If you have a camera on your phone, you can use that phone for pictures. I've uploaded pictures of what I've shown my family what i've done since i got the phone."

OpenAI says the models come up with "interesting" and "coherent" text on the first go about half of the time.

"It tries to always start predicting [the next word] given as little information as possible," Radford said. "[The] more context you can give it -- for example, capitalization -- the better it'll ... perform."

During experiments involving zero-shot domain transfer, in which the model hadn't been trained beforehand on any dataset specific to the tests, OpenAI says that the largest of its four language systems -- OpenAI GPT-2 -- managed to obtain state-of-the-art scores in seven of eight benchmarks, including LAMBADA (a test of models' ability to model long-range dependencies in text), the Winograd Schema Challenge (a measure of capacity to resolve ambiguities in text), and the Penn Treebank (a collection of millions of words of part-of-speech tagged text). In some tests, it even approached human-level accuracy.

Evaluated on the Children's Book Test, for example, which examines how well systems can capture the meaning of different categories of words, GPT-2 was 93.3 percent accurate in predicting nouns compared with human subjects' 96 percent, and 89.05 percent accurate at anticipating named entities (compared with humans' 92 percent).

It also demonstrated an aptitude for unsupervised learning tasks. In question-answering tests where it was provided a context and prompted with queries ("Who wrote the book the origin of species?"; "What is the most common blood type in Sweden?"; "Who came up with the theory of relativity?"), it supplied answers with up to 83.4 percent probability.

"[It's] able to leverage a much larger model and a lot more data across all of these domains to kind of be a generalist, where it's pretty ... good in any general language prediction task. And in very targeted functionality like summarization or translation, [it's] showing promising preliminary results," Radford said. "[T]hat's super exciting, because [these are] method[s] where we [didn't] explicitly train on these tasks."

Still, it's far from the be all end all of NLP, Radford and Jeffrey Wu, a member of OpenAI's technical staff, caution. None of the models can see more than a page of data at a time, and they're not entirely consistent when it comes to reasoning -- they sometimes fudge numbers, or switch topics in a nonsensical way. Wu, Radford, and the rest of OpenAI's language team leave those shortcomings to future work.

"There are a lot of things to investigate," Wu said. "[W]e're very interested in seeing what the remainder of [the performance] curve looks like. [It] could be that [it] starts leveling out and we need some new research advances, and it could be that just increasing scale keeps giving us gains. [We're] still working on that."

Deepfake news

In a break from tradition, OpenAI says it's choosing not to release the dataset used to train its NLP models, nor three of the four language models or the training code. It won't withhold the text generator frontend -- it plans to make it available publicly as a tool people can interact with directly -- but it believes that publishing the rest might open the door to abusive behavior by bad actors.

"The generality of large language models highlights the omni-use nature of AI," OpenAI wrote in a blog post. "The same tool that an artist could use to help them write a short fiction story ... can also be used to do things like generate synthetic financial news about specific companies ... screeds of racist, sexist, or uninclusive text ... create fake reviews on well-known sites like Amazon or Yelp ... or augment political information influence operations ... For that reason, we're attempting a form of responsible disclosure with this release, where we want to communicate about what we've done in a responsible manner that empowers other important stakeholders, like journalists and policymakers, to also understand and verify what we've done."

OpenAI has a point -- AI systems that can be used to generate misleading content have come under increased scrutiny in recent times. In September, members of Congress sent a letter to National Intelligence director Dan Coats requesting a report from intelligence agencies about the potential impact on democracy and national security of deepfakes -- videos created using AI that digitally grafts faces onto other people’s bodies. During a congressional hearing in late 2018, members of Congress speaking with Facebook COO Sheryl Sandberg and Twitter CEO Jack Dorsey also expressed concerns about the potential impact of manipulative deepfake videos.

There is certainly a risk that tools like OpenAI's cutting-edge language models might be used to generate untrue or misleading stories, contributing to the enormous volume already published daily. In March 2018, half of the U.S. population reported seeing deliberately misleading articles on news websites. And Gartner predicts that by 2022, if current trends hold, a majority of people in the developed world will see more false than true information.

MIT researchers -- along with startups like MetaFact and AdVerify.ai -- have attempted to fight the spread of both human- and machine-written fake news with automated tools that can determine whether a source is accurate or politically prejudiced. But some experts aren't convinced that AI is up to the task of fighting AI. Dean Pomerleau, a Carnegie Mellon University Robotics Institute scientist who helped organize the Fake News Challenge, a competition to crowdsource bias detection algorithms, told the Verge in an interview that AI lacked the nuanced understanding of language necessary to suss out untruths and false statements.

"We actually started out with a more ambitious goal of creating a system that could answer the question 'Is this fake news, yes or no?'" he said. "We quickly realized machine learning just wasn't up to the task."

Human fact-checkers aren't necessarily better. This year, Google suspended Fact Check, a tag that appeared next to stories in Google News that “include information fact-checked by news publishers and fact-checking organizations,” after conservative outlets accused it of exhibiting bias against them.

It's clear that there's work to be done in the policy arena -- and with today's announcement, OpenAI hopes to not only demonstrate the impressive gains it has made in NLP, but to spark debate among researchers and regulators.

"We see some restraint on publication as a healthy characteristic of technological fields with transformative societal consequences," OpenAI said. "In this case, we were guided initially by a rough consensus within the organization that these results were qualitatively different from prior ones, and that the misuse potential was more pronounced than with prior projects we have been involved in. We eventually hope to create a global community of AI practitioners that think about the information hazards of particular types of releases."

Deepfake news

More