Multilingual masked language modeling involves training an AI model on text from several languages, and it’s a technique that’s been used to great effect. In March, a team introduced an architecture that can jointly learn sentence representations for 93 languages belonging to more than 30 different families. But most previous work in language modeling has investigated cross-lingual transfer with a shared vocabulary across monolingual data sets. By contrast, Facebook researchers recently set out to explore whether linguistic knowledge transfer could be achieved with text from very different domains.
In a paper published on the preprint server Arxiv.org this week, scientists at Facebook AI and Johns Hopkins University detail the effects of different masked language modeling pretraining approaches on cross-lingual transfer. They say they’ve uncovered evidence that universal representations emerge in pretrained models without any shared vocabulary or domain similarity, even when only a small subset of the parameters — variables that help control the overall model’s performance — are shared. In fact, they claim that by sharing parameters alone, pretraining learns to map similar words and sentences to similar hidden representations.
Over the course of several experiments, the researchers sought to evaluate performance on several different cross-lingual transfer tasks, and to suss out the factors that play outsize roles in making models multilingual. Additionally, they attempted to determine whether independently trained monolingual models like Google’s BERT learn similar representations across languages.
The team reports that parameter sharing is the most important factor in performance, and that word-level, contextual word-level, and sentence-level AI model representations can indeed be aligned with simple mapping. This last finding, they say, provides insight into why parameter sharing alone is sufficient for multilingual representations to emerge in multilingual masked language models.
“We find that monolingual models trained in different languages learn representations that align with each other surprisingly well, as compared to the same language upper bound, even though they have no shared parameters,” wrote the paper’s coauthors, who plan to investigate more distant language pairs in future work. “[This suggests] it should be possible to adapt to pretrained [models] to new languages with little additional training and it may be possible to better align independently trained representations without having to jointly train on all of the (very large) unlabeled data that could be gathered.”
The work builds on Facebook’s extensive work in natural language processing, some of which it detailed in a blog post last month. The tech giant’s word2vec model uses raw audio to improve speech recognition, while its self-supervision model — ConvLM — recognizes words outside of its training lexicon with high accuracy. In a related development, Facebook recently demonstrated a machine learning system — Polyglot — that when given voice data is able to produce new speech samples in multiple languages, and researchers at the company devised ways to enhance Google’s BERT language model and achieve performance exceeding state-of-the-art results across popular benchmark data sets.