Sophisticated natural language processing systems like OpenAI’s GPT-2 can craft speech that’s impressively humanlike, but those same AI often struggle with cogencY and coherency. In particular, they don’t pen compelling conclusions — AI-generated story endings tend to be generic and lacking in context.

This shortcoming motivated scientists at Carnegie Mellon University’s School of Computer Science to devise a method that creates more “diverse” endings for a given story. The key, they say, was training models to focus attention on important phrases of the story and promoting the generation of non-generic words.

“A story context is a sequence of sentences connecting characters and events. This task is challenging as it requires modeling the characters, events, and objects in the context, and then generating a coherent and sensible ending based on them. Generalizing the semantics of the events and entities and their relationships across stories is a non-trivial task,” wrote the coauthors. “We show that the combination of the two leads to more diverse and interesting endings.”

AI NLP

Above: A few of the proposed model’s outputs.

The team tapped seq2seq — a type of long short-term memory recurrent neural network architecture that’s capable of learning dependencies — with attention to create mathematical representations of words belonging to the context of the target story, and to learn those words’ relationships and translate them back into human-readable text. To incorporate key phrases from the story context, the researchers used an algorithm — RAKE — that picked out phrases and assigned them scores based on word frequency and co-occurrence, and then they manually sorted the phrases by their corresponding scores and discarded those below a certain threshold.

To generate endings, the scientists trained their model on the ROCStories corpus, which contains over 50,000 five-sentence stories. And to evaluate the model, they used DIST (Distinct), which calculates numbers of distinct unigrams (the contiguous sequence of n items from a given sample of text or speech), bigrams (a pair of consecutive written units such as letters, syllables, or words), and trigrams (a trio of consecutively written units) in the generated responses divided by the total numbers of unigrams, bigrams, and trigrams.

In a separate test, they trained Google’s BERT on the open source Story-Cloze task to compare their model with a baseline by selecting the correct ending of a story given two choices.

So how’d the AI perform? Let’s just say a Pulitzer isn’t in the cards. While it was the top performer in DIST and managed to get a Story-Cloze test accuracy of 72%, it occasionally generated nonsensical endings like “katie was devastated by himself and dumped her boyfriend” or referred to nouns with incorrect pronouns (“katie,” “himself”).

The researchers concede that further work is needed to ensure the outputs “entail the story context at both semantic and token level,” and that they’re logically sound and consistent. Still, they assert that they’ve “quantitatively” and “qualitatively” shown that their model can achieve “meaningful” improvements over the baselines.