Google achieves state-of-the-art NLP performance with an enormous language model and data set

Transfer learning, or a technique that entails pretraining an AI model on a data-rich task before fine-tuning it on another task, has been successfully applied in domains from robotics to object classification. But it holds particular promise in the subfield of natural language processing (NLP), where it's given rise to a diversity of benchmark-besting approaches. To advance it further, researchers at Google developed a new data set -- Colossal Clean Crawled Corpus -- and a unified framework and model dubbed Text-to-Text Transformer that converts language problems into a text-to-text format. They say that in experiments with one of the largest models ever submitted to the General Language Understanding Evaluation (GLUE) benchmark, they achieve state-of-the-art results on benchmarks covering question answering, text classification, and more.

Generally speaking, training a model to perform NLP tasks involves ensuring the model develops knowledge enabling it to "understand" text -- knowledge that might range from low-level (for example, the spelling or meaning of words) to high-level (that a tuba is too large to fit in most backpacks). The team at Google investigated an approach that took text as input and produced new text as output, applying the same objective, training procedure, and decoding process to every task considered.

Snippets in the general-knowledge training corpora they compiled -- the aforementioned Colossal Clean Crawled Corpus -- were sourced from the Common Crawl project, which scrapes roughly 20 terabytes of English text from the web each month. In order to filter out gibberish, boilerplate menus, and error messages, they retained only text lines that ended in a terminal punctuation mark (a period, exclamation mark, question mark, or end quotation mark) while deleting pages with obvious filler text and duplicates. The resulting collection is a claimed order of magnitude larger than most data sets used for pre-training, at around 750GB.

The researchers trained several Transformer-based models on the corpus to evaluate the effectiveness of their text-to-text approach. For the uninitiated, Transformers are a novel type of neural architecture introduced in a 2017 paper coauthored by scientists at Google Brain, Google's AI research division. As all deep neural networks, they contain neurons (mathematical functions) arranged in interconnected layers that transmit signals from input data and slowly adjust the synaptic strength (weights) of each connection. That's how all AI models extract features and learn to make predictions, but Transformers uniquely have attention such that every output element is connected to every input element. The weightings between them are calculated dynamically, effectively.

The largest model contained up to 11 billion parameters, or configuration variables internal to the model that are required when making predictions. Fine-tuned on various language tasks, the team says it managed a state-of-the-art average score (89.7) on GLUE and the reading comprehension benchmarks SQuAD and CNN/Daily Mail. And tested on SuperGLUE, which comprises tasks beyond the scope of current NLP systems but solvable by college-educated speakers, it nearly matched human performance with a score of 89.8.

The team concedes their model fell short in linguistic tasks like translation, which they blame on a relative dearth of task-specific data and insufficient training scale. As a result, they advocate for research on methods that achieve stronger performance with smaller models so that transfer learning can be applied where it will have the most impact.

"An unsurprising but important result from our study is that larger models tend to perform better," the paper's coauthors wrote. "The fact that the hardware used for running these models is continually getting cheaper and more powerful suggests that scaling up may continue to be a promising way to achieve better performance [Sutton, 2019]. However, it will always be the case that there are applications and scenarios where using a smaller or less expensive model is helpful, for example when performing client-side inference or federated learning."

More