Amazon researchers reduce data required for AI transfer learning

Cross-lingual learning is an AI technique involving training a natural language processing model in one language and retraining it in another. It's been demonstrated that retrained models can outperform those trained from scratch in the second language, which is likely why researchers at Amazon's Alexa division are investing considerable time investigating them.

In a paper scheduled to be presented at this year's Conference on Empirical Methods in Natural Language Processing, two scientists at the Alexa AI natural understanding group -- Quynh Do and Judith Gaspers -- and colleagues propose a data selection technique that halves the amount of required training data. They claim that it surprisingly improves rather than compromises the model's overall performance in the target language.

"Sometimes the data in the source language is so abundant that using all of it to train a transfer model would be impractically time consuming," wrote Do and Gaspers in a blog post. "Moreover, linguistic differences between source and target languages mean that pruning the training data in the source language, so that its statistical patterns better match those of the target language, can actually improve the performance of the transferred model."

In the course of experiments, Do, Gaspers, and the team employed two methods to cut the source-language data set in half: the aforementioned data selection technique and random sampling. They pretrained separate models on the two halved data sets and on the full data set, after which they fine-tuned the models on a small data set in the target language.

Do and Gaspers note that all of the models were trained simultaneously to recognize intents (requested actions) and fill slots (variables on which the intent acts), and that they took as inputs multilingual embeddings (a word or sequences of words from different languages mapped to a single point in a multidimensional space) to bolster model accuracy. The team combined the multilingual embedding of each input word with a character-level embedding that encoded information about words' prefixes, suffixes, and roots, and they tapped language models trained on large text corpora to select the source language data that would be fed to the transfer model.

Within the system the researchers engineered, a bilingual dictionary translated each utterance in the source data set into a string of words in the target language. Four language models were applied to the resulting strings, while a trigram model handled character embeddings. And for each utterance in the the sum of probabilities computed by the four language models, only those yielding the highest normalized score were selected.

To evaluate their approach, the team first transferred a model from English to German with different amounts of training data in the target language (10,000 and 20,000 utterances, respectively, versus millions of utterances in the full source-language data set). Then they trained the transfer model on three different languages -- English, German, and Spanish -- before transferring it to French (with 10,000 and 20,000 utterances in the target language). They claim that the transfer models outperformed a baseline model trained only on data in the target language, and that relative to the model trained on the target language alone, the model trained using the novel data selection technique showed improvements of 3% to 5% on the slot-filling task and about 1% to 2% on intent classification.