MIT CSAIL is using unsupervised learning for language translations

Machine learning has paved the way for faster and more accurate language translation than ever before, but it's no Babel fish. Cutting-edge systems from Google, Amazon, Microsoft, and others require artificially intelligent (AI) models to ingest millions of documents that have been translated by hand, which they use to find matching words and phrases in the target language. But that's not a viable approach for the thousands of dialects that lack large corpora.

That's why researchers at the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Lab (MIT CASAIL) took a different tack. In a paper that'll be presented this week at the Conference on Empirical Methods in Natural Language Processing, they describe an unsupervised model -- i.e., a model that learns from test data that hasn't been explicitly labeled or categorized -- that can translate between texts in two languages without direct translational data between the two.

It follows Facebook's forays into unsupervised machine learning translation. In August, Facebook AI Research (FAIR) -- collaborating with the firm's Applied Machine Learning division -- devised a model that uses a combination of word-for-word translations, language models, and back translations to outperform systems for language pairings.

"[Our] model sees the words in the two languages as sets of vectors, and maps [those vectors] from one set to the other by essentially preserving relationships," Tommi Jaakkola, a CSAIL researcher and the paper's coauthor, told MIT News. "The approach could help translate low-resource languages or dialects, so long as they come with enough monolingual content."

Core to the approach is what's called the Gromov-Wasserstein distance, a statistical metric that records the distance between points in one computational space and matches them to similarly distanced points in another. Here, it's applied to embeddings -- mathematical representations of words called vectors -- with words of similar meanings clustered together. In the end, the model is able to align the vectors in embeddings that are most closely correlated by relative distances, a sign they're likely to be direct translations.

The researchers' system -- which was trained and tested on FASTTEXT, a dataset of publicly available word embeddings with 110 language pairs -- assigns a probability that similarly distanced vectors in one language's word embeddings will correspond with similar clusters in the second language. And it quantifies the similarity between languages with a numerical value, calculating the distance of vectors from one another in two embeddings.

The closer the vectors, the closer the score is to zero. Romance languages like French and Spanish tend toward 1, while Chinese falls between 6 and 9 when paired with other major languages.

Aligning word embeddings isn't an entirely novel method, the researchers concede, but the system's use of relational distances makes it more efficient than prior implementations, requiring a fraction of the computation power and little or no tuning.

"The model doesn't know [there are months in a year]" for example, David Alvarez-Melis, CSAIL doctoral student and first author of the paper, said. "It just knows there is a cluster of 12 points that align with a cluster of 12 points in the other language, but they're different to the rest of the words, so they probably go together well. By finding these correspondences for each word, it then aligns the whole space simultaneously."

It's not the only recent innovation in the machine translation space. In October, Baidu developed an AI system capable of simultaneously translating two languages at once. And in June, Google brought offline neural machine translation in 59 languages to Google Translate on iOS and Android.

More