Language model pretraining, a technique that “teaches” machine learning systems contextualized text representations by having them predict words based on their contexts, has advanced the state of the art across a range of natural language processing objectives. However, models like Google’s BERT, which are bidirectional in design (meaning they draw on left-of-word and right-of-word context to form predictions), aren’t well-suited to the task of natural language generation with substantial modification.

That’s why scientists at Microsoft Research investigated an alternative approach dubbed UNIfied pre-trained Language Model (UniLM), which completes unidirectional, sequence-to-sequence, and bidirectional prediction tasks and which can be fine-tuned for both natural language understanding and generation. They claim it compares favorably to BERT on popular benchmarks, achieving state-of-the-art results on a sampling of abstract summarization, generative question answering, and language generation data sets.

UniLM is a multi-layer network at its core, made up of Transformer AI models jointly pretrained on large amounts of text and optimized for language modeling. For the uninitiated, Transformers contain interconnected neurons (functions) that transmit signals from input data and adjust the strength (weights) of each connection. It’s how all AI systems extract features and learn to make predictions, but Transformers have attention such that every output element is connected to every input element. The weightings between them are calculated dynamically, in effect.

According to the researchers, the pretrained UniLM is similar to BERT in that it can be fine-tuned (with additional task-specific layers if necessary) to adapt to various downstream tasks. But unlike BERT, UniLM can be configured using different self-attention masks to aggregate context for different types of language models. Additionally, owing to the unified nature of their pretraining, the Transformer networks can share parameters (data learned from historical training), which makes learned text representations more general and thus mitigates overfitting (when a system models training data too well) to any single task.

Pretrained using articles from English Wikipedia and the open source BookCorpus, which have a combined vocabulary size of 28,996, the researchers report that UniLM performed impressively across language tasks. Specifically, they say it achieved results on par with that of BERT on the GLUE benchmark (which evaluates general language understanding) and two question-answering data sets, and that it outperformed previous state-of-the-art models on five natural language generation data sets, including CNN/DailyMail (which tests summarization), Gigaword (abstractive summarization), SQuAD (question generation), CoQA (generative question answering), and DSTC7 (dialog response generation).

The team leaves to future work pushing the limit of their current method by training larger models on “web-scale” text corpora. They also hope to investigate extending UniLM to support cross-lingual tasks.

The code and pretrained models are available on GitHub.