Condensing paragraphs into sentences isn’t easy for artificial intelligence (AI). That’s because it requires a semantic understanding of the text that’s beyond the capabilities of most off-the-shelf natural language processing models. But it’s not impossible, as researchers at Microsoft recently demonstrated.

In a paper published on the preprint server Arxiv.org (“Structured Neural Summarization“), scientists at Microsoft Research in Cambridge, England describe an AI framework that can reason about relationships in “weakly structured” text, enabling it to outperform conventional NLP models on a range of text summarization tasks.

When trained on articles from CNN and the Daily Mail (along with sentences summarizing each article), it was able to generate summaries like:

n’golo kante is attracting interest from a host of premier league clubs . marseille have been in constant contact with caen over signing the 24-year-old . the 24-year-old has similarities with lassana diarra and claude makelele in terms of stature .

It calls to mind systems like Primer, which use AI to parse and collate a large number of documents. But Microsoft’s AI is much more generalizable.

“Summarization, the task of condensing a large and complex input into a smaller representation that retains the core semantics of the input, is a classical task for natural language processing systems,” the researchers wrote. “Automatic summarization requires a machine learning component to identify important entities and relationships between them, while ignoring redundancies and common concepts … However, while standard [models] theoretically have the ability to handle arbitrary long-distance relationships, in practice they often fail to correctly handle long texts and are easily distracted by simple noise.”

Their two-step solution consisted of an extended sequence encoder — an AI model that processes an input sequence and predicts the next characters of the target sequence, given the previous characters of the target sequence — and a neural network that directly learned from graph representations of annotated natural language.

The hybrid system tapped a sequence encoder (one extended to leverage known relationships among elements in the input data) to feed a graph network with “rich input”: a bidirectional long short-term network (LSTM) and a sequence GNN extension, and an LSTM decoder with a pointer network extension. (Bidirectional LSTMs are a category of recurrent neural network that are capable of learning long-term dependencies; they allow the neural networks to combine their memory and inputs to improve their prediction accuracy.)

Microsoft NLP

Above: Summarizations produced from article snippets.

Image Credit: Microsoft

The team set the model — Sequence GNNs — loose on three summarization tasks: Method naming, or inferring the name of a code function (or method) given the source code; Method doc, predicting a description of the functionality of a method; and NL summarization, the creation of a novel natural language summary given some text input.

Two datasets were selected for the first task: a small Java dataset, which was split up for training, validation, and testing; and a second dataset generated from 23 open source projects in C# mined from GitHub. For the second task — Method doc — the researchers used the dataset of 23 open source C# projects, and for the third (NL summarization), they scraped the aforementioned news articles from CNN and the Daily Mail (along with sentences summarizing each article).

To produce graphs from which the AI model could extract information, the team first broke data into identifier tokens (and subtokens) and then constructed graphs by connecting the tokens. The code was tokenized into variables, methods, classes, and other types, while text from the articles corpus was run through Stanford’s CoreNLP open source tokenization tool.

So how well did the AI system perform?

The Sequence GNNs achieved state-of-the-art performance in the Method naming task on both the Java and C# datasets with F-scores (a metric that describes performance on a scale from zero to one) of 51.4 and 63.4, respectively. It performed slightly worse in Method doc, which the researchers chalk up to the length of the predictions. (The ground truth had an average of 19 tokens compared to the model’s 16.) And on NL summarization, it fell short of recent work; that said, the researchers believe it’s “due to … [the] simplistic decoder” and “training objective,” and that it can be improved in future work.

“We are excited about this initial progress and look forward to deeper integration of mixed
sequence-graph modeling in a wide range of tasks across both formal and natural languages,” they wrote. “The key insight, which we believe to be widely applicable, is that inductive biases induced by explicit relationship modeling are a simple way to boost the practical performance of existing deep learning systems.”