Hugging Face dives into machine translation with release of 1,000 models

Hugging Face is taking its first step into machine translation this week with the release of more than 1,000 models. Researchers trained models using unsupervised learning and the Open Parallel Corpus (OPUS). OPUS is a project undertaken by the University of Helsinki and global partners to gather and open-source a wide variety of language data sets, particularly for low resource languages. Low resource languages are those with less training data than more commonly used languages like English.

Started in 2010, the OPUS project incorporates popular data sets like JW300. Available in 380 languages, the Jehovah's Witness text is utilized by a number of open source projects for low resource languages like the Masakhane to create machine translation from English to 2,000 African languages. Translation can enable interpersonal communication between people who speak different languages and empower people around the world to participate in online and in-person commerce, something that will be especially important for the foreseeable future.

The launch Thursday means models trained with OPUS data now make up the majority of models provided by Hugging Face and the University of Helsinki's Language Technology and Research Group the largest contributing organization. Before this week, Hugging Face was best known for enabling easy access to state-of-the-art language models and language generation models, like Google's BERT, which can predict the next characters, words, or sentences that will appear in text.

With more than 500,000 Pip installs, the Hugging Face Transformers library for Python includes pretrained versions of advanced and state-of-the-art NLP models like versions of Google AI's BERT and XLNet, Facebook AI's RoBERTa, and OpenAI's GPT-2.

Hugging Face CEO Clément Delangue told VentureBeat that the venture into machine translation was a community-driven initiative that the company undertook to build more community around cutting-edge NLP, following a $15 million funding round in late 2019.

"Because we open source, and so many people are using our libraries, we started to see more and more groups of people in different languages getting together to work on pretraining some of our models in different languages, especially low resource languages, which are kind of like a bit forgotten by a lot of people in the NLP community," he said. "It made us realize that in our goal of democratizing NLP, a big part to achieve that was not only to get the best results in English, as we've been doing, but more and more provide access to other languages in the model and also provide translation."

Delangue also said the decision was due to recent advances in machine translation and sequence-to-sequence (Seq2Seq) models. Hugging Face first started working with Seq2Seq models in the past few months, Delangue said. Notable recent machine translation models include T5 from Google and Facebook AI Research's BART, which is an autoencoder for training Seq2Seq models.

"Even a year ago we might not have done it just because the results of pure machine translation weren't that good. Now it's getting to a level where it's starting to make sense and starting to work," he said. Delangue added that Hugging Face will continue to explore data augmentation techniques for translation.

The news follows an integration earlier this week with Weights and Biases to power visualizations that track, log, and compare training experiments. Hugging Face brought its Transformers library to TensorFlow last fall.

More