Facebook researchers propose 'pre-finetuning' to improve language model performance

Machine learning researchers have achieved remarkable success with language model pretraining, which uses self-supervision, a training technique that doesn't require labeled data. Pretraining refers to training a model with one task to help it recognize patterns that can be applied to a range of other tasks. In this way, pretraining imitates the way human beings process new knowledge. That is, using parameters of tasks that have been learned before, models learn to adapt to new and unfamiliar tasks.

For many natural language tasks, however, training examples for related problems exist. In an attempt to leverage these, researchers at Facebook propose "pre-finetuning," a methodology of training language models that involves a learning step with over 4.8 million training examples performed on around 50 classification, summarization, question-answering, and commonsense reasoning datasets. They claim that pre-finetuning consistently improves performance for pretrained models while also significantly improving sample efficiency during fine-tuning.

It's an approach that has been attempted before, often with success. In a 2019 study, researchers at the Allen Institute noticed that pre-finetuning a BERT model on a multiple choice question dataset appeared to teach the model something about multiple choice questions in general. A subsequent study found that pre-finetuning increased a model's robustness for name swaps, where the names of different people were swapped in a sentence about which the model had to answer.

In order to ensure that their pre-finetuning stage incorporated general language representations, the researchers included tasks in four different domains: classification, commonsense reasoning, machine reading comprehension, and summarization. They call their pre-finetuned models MUPPET, which roughly stands for "Massive Multi-task Representation with Pre-finetuning."

After pre-finetuning RoBERTa and BART, two popular pretrained models for natural language understanding, the researchers tested their performance on widely-used benchmarks including RTE, BoolQ, RACE, SQuAD, and MNLI. Interestingly, the results show that pre-finetuning can hurt performance when few tasks are used to a critical point, usually above 15 tasks. But pre-finetuning beyond this point leads to performance improvements correlated with the number of language tasks. MUPPET models outperform their vanilla pretrained counterparts and leveraging representations with 34-40 tasks enables the models to reach higher even accuracies with less data than a baseline RoBERTa model.

"These [performance] gains are particularly strong in the low resource regime, where there is relatively little labeled data for fine-tuning," the researchers wrote in a paper describing their work. "We show that we can effectively learn more robust representations through multitask learning at scale. ... Our work shows how even seemingly very different datasets, for example, summarization and extractive QA, can help each other by improving the model's representations."

More