Researchers propose using 'rare word' dictionaries to bolster unsupervised language model training

In a preprint study, researchers at Microsoft, Peking University, and Nankai University say they've developed an approach -- Taking Notes on the Fly (TNF) -- that makes unsupervised language model pretraining more efficient by noting rare words to help models understand when (and where) they occur. They claim experimental results show TNF "significantly" bolsters pretraining of Google's BERT while improving the model's performance, resulting in a 60% decrease in training time.

One of the advantages of unsupervised pretraining is that it doesn't require annotated data sets. Instead, models train on massive corpora from the web, which improves performance on various natural language tasks but tends to be computationally expensive. Training a BERT-based model on Wikipedia data requires more than five days using 16 Nvidia Tesla V100 graphics cards; even small models like ELECTRA take upwards of four days on a single card.

The researchers' work aims to improve efficiency through better data utilization, taking advantage of the fact that many words appear very few times (in around 20% of sentences, according to the team) in training corpora. The embeddings of those words -- i.e., the numerical representations from which the models learn -- are usually poorly optimized, and the researchers argue these words could slow down the training process of other model parameters because they don't carry adequate semantic information to make models understand what they mean.

TNF was inspired by how humans grasp information. Note-taking is a useful skill that can help recall tidbits that would otherwise be lost; if people take notes after encountering a rare word that they don't know, the next time the rare word appears, they can refer to the notes to better understand the sentence. Similarly, TNF maintains a note dictionary and saves a rare word's context information when the rare word occurs. If the same rare word occurs again in training, TNF employs the note information to enhance the semantics of the current sentence.

The researchers say TNF introduces little computational overhead at pretraining since the note dictionary is updated on the fly. Moreover, they assert it's only used to improve the training efficiency of the model and isn't served as part of the model; when the pretraining is finished, the note dictionary is discarded.

To evaluate TNF's efficacy, the coauthors concatenated a Wikipedia corpus and the open source BookCorpus into a single 16GB data set, which they preprocessed, segmented, and normalized. They used it to pretrain several BERT-based models, which they then fine-tuned on the popular General Language Understand Evaluation (GLUE) benchmark.

The researchers report that TNF accelerates the BERT-based models throughout the entire pretraining process. The average GLUE scores were larger than the baseline through most of the pretraining, with one model reaching BERT's performance within two days while it took a TNF-free BERT model nearly six days. And the BERT-based models with TNF outperformed the baseline model on the majority of sub-tasks (eight tasks in total) by "considerable margins" on GLUE.

"TNF alleviates the heavy-tail word distribution problem by taking temporary notes for rare words during pre-training," the coauthors wrote. "If trained with the same number of updates, TNF outperforms original BERT pre-training by a large margin in downstream tasks. Through this way, when rare words appear again, we can leverage the cross-sentence signals saved in their notes to enhance semantics to help pre-training."