Alexa scientists teach AI language models new tongues with transfer learning

Adding support for a new language to a voice assistant like Alexa isn't as easy as you might think, but researchers at Amazon believe they've developed a method that'll expedite and simplify the process. In a newly published paper ("Cross-Lingual Transfer Learning for Spoken Language Understanding") and accompanying blog post, they describe a technique that adapts a machine learning model trained in one tongue to another with minimal training data.

The method, which the paper's coauthors are scheduled to present at the International Conference on Acoustics, Speech, and Signal Processing in Barcelona, Spain next month, relies on transfer learning (specifically cross-lingual transfer learning) to bootstrap new functions. They report that, in experiments, it reduced the data requirements for new languages by up to 50 percent.

"We believe that this is the first time that cross-lingual transfer learning has been used to translate a joint intent-slot classifier into a new language," Alexa AI Natural Understanding scientists Quynh Do and Judith Gaspers said.

As they explain, spoken language understanding (SLU) systems typically involve two subtasks -- intent classification and slot tagging -- where an intent is the task a user wants performed and a slot implies the entities on which the intent acts. (For example, in the voice command "Alexa, play 'High Hopes' by Panic! at the Disco," the intent is PlayMusic, and "High Hopes" and "Panic! at the Disco" fill the SongName and ArtistName slots.)

Training intent and slot classifiers jointly improves performance, Do and Gaspers note, so they and colleagues explored six different jointly trained AI systems. After comparing their performance with an open-sourced benchmark data set of English-language SLU examples, the team identified three that outperformed their predecessors on both classification tasks.

Next, they experimented with word embeddings (series of fixed-length coordinates corresponding to points in multidimensional space) and character embeddings (clusters reflecting the meanings of words and their component parts), which they fed into six different neural networks in total, including a type of recurrent network called a long-short-term-memory (LSTM) network that processes sequenced inputs in order and outputs factors in those that preceded it. And they used data from the source language (in this case English) to improve SLU performance in the target (German), chiefly by pretraining the SLU model and fine-tuning it on a target data set.

In a large-scale test, they created a corpus from one million utterances sampled from an English Alexa SLU system, plus random samples of 10,000 and 20,000 utterances from a German Alexa SLU system. The development set consisted of 2,000 utterances from the German system.

With bilingual input embeddings trained to group semantically similar words from both languages, the researchers found that a transferred model whose source data was the million English utterances and whose target data was the 10,000 German utterances classified intents more accurately than a monolingual model trained on 20,000 German utterances. With both the 10,000- and 20,000-utterance German data sets, the transferred model achieved a 4 percent improvement in slot classification score versus a monolingual model trained on only the German utterances.

"Although the highway LSTM model was the top-performing model on the English-language test set, that doesn't guarantee that it will yield the best transfer learning results," they wrote. "In ongoing work, we're transferring the other models to the German-language context, too."

More