Google's AI trains state-of-the-art language models using less compute and data

In a recent study, researchers at Google proposed Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA), an AI language training technique that outperforms existing methods given the same amount of computing resources. This week, months after its publication, the coauthors released the codebase (and pretrained models) for TensorFlow, laying the groundwork for powerful models capable of performing language tasks with state-of-the-art accuracy. The models could someday make their way into customer service chatbots, or they might be incorporated into a tool that summarizes reports for executive teams.

Pretraining methods generally fall under two categories: language models (e.g., OpenAI's GPT), which process input text left-to-right and predict the next word given the previous context, and masked language models (e.g., Google's BERT and ALBERT, and Facebook's RoBERTa), which predict the identities of a small number of words that have been masked out of the input. Masked language models have an advantage in that they "see" the text to both the left and right of the token (i.e., word) being predicted, but their predictions are limited to a small subset of input tokens, reducing the amount they can learn from each sentence.

ELECTRA's secret sauce is a pretraining task called replaced token detection, which trains a bidirectional model (like a masked language model does) while learning from all input positions, much like a language model. This discriminator model is tasked with distinguishing between "real" and "fake" input data; ELECTRA "corrupts" the input by replacing some tokens with incorrect -- but somewhat plausible -- fakes, after which it requires the model to determine which tokens have been replaced or kept the same.

The replacement tokens are sourced from another AI model referred to as the generator. The generator can be any model that produces an output distribution over tokens, but the Google researchers used a small masked language model trained jointly with the discriminator. The generator and discriminator share the same input word embeddings. After the pretraining phase, the generator is dropped and the discriminator (the ELECTRA model) is fine-tuned on various downstream tasks.

The team reports that in experiments, ELECTRA "substantially" improved over previous methods, with performance comparable to RoBERTa and XLNet using less than 25% of the compute. The researchers managed to outperform GPT after training a small ELECTRA model on a single graphics card (1/30th of the compute) in 4 days. And with a large ELECTRA model trained using far more compute, they attained state-of-the-art performance on the SQuAD 2.0 question-answering data set and the GLUE leaderboard of language understanding tasks. (ELECTRA didn't beat Google's own T5-11b on GLUE, but the researches note that it's 1/30th the size and uses 10% of the compute to train.)

ELECTRA matches the performance of RoBERTa and XLNet on the GLUE natural language understanding benchmark when using less than 1/4 of their compute and achieves state-of-the-art results on the SQuAD question-answering benchmark. ELECTRA's excellent efficiency means it works well even at small scale -- it can be trained in a few days on a single GPU to better accuracy than GPT, a model that uses over 30x more compute. ELECTRA is being released as an open source model on top of TensorFlow and includes a number of ready-to-use pretrained language representation models.

"ELECTRA needs to see fewer examples to achieve the same performance because it receives mode training signal per example," wrote student researcher Kevin Clark and Google Brain senior research scientist Thang Luong in a blog post. "At the same time, RTD results in powerful representation learning, because the model must learn an accurate representation of the data distribution in order to solve the task."

More