Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. Watch now.


Artificial speech translation is a rapidly emerging artificial intelligence (AI) technology. Initially created to aid communication among people who speak different languages, this speech-to-speech translation technology (S2ST) has found its way into several domains.  For example, global tech conglomerates are now using S2ST for directly translating shared documents and audio conversations in the metaverse.

At Cloud Next ’22 last week, Google announced its own speech-to-speech AI translation model, “Translation Hub,” using cloud translation APIs and AutoML translation. Now, Meta isn’t far behind.

Meta AI today announced the launch of the universal speech translator (UST) project, which aims to create AI systems that enable real-time speech-to-speech translation across all languages, even those that are spoken but not commonly written. 

“Meta AI built the first speech translator that works for languages that are primarily spoken rather than written. We’re open-sourcing this so people can use it for more languages,” said Mark Zuckerberg, cofounder and CEO of Meta. 

Event

Intelligent Security Summit

Learn the critical role of AI & ML in cybersecurity and industry specific case studies on December 8. Register for your free pass today.

Register Now

According to Meta, the model is the first AI-powered speech translation system for the unwritten language Hokkien, a Chinese language spoken in southeastern China and Taiwan and by many in the Chinese diaspora around the world. The system allows Hokkien speakers to hold conversations with English speakers, a significant step toward breaking down the global language barrier and bringing people together wherever they are located — even in the metaverse. 

This is a difficult task since, unlike Mandarin, English, and Spanish, which are both written and oral, Hokkien is predominantly verbal.

How AI can tackle speech-to-speech translation

Meta says that today’s AI translation models are focused on widely-spoken written languages, and that more than 40% of primarily oral languages are not covered by such translation technologies. The UST project builds upon the progress Zuckerberg shared during the company’s AI Inside the Lab event held back in February, about Meta AI’s universal speech-to-speech translation research for languages that are uncommon online. That event focused on using such immersive AI technologies for building the metaverse. 

To build UST, Meta AI focused on overcoming three critical translation system challenges. It addressed data scarcity by acquiring more training data in more languages and finding new ways to leverage the data already available. It addressed the modeling challenges that arise as models grow to serve many more languages. And it sought new ways to evaluate and improve on its results.

Meta AI’s research team worked on Hokkien as a case study for an end-to-end solution, from training data collection and modeling choices to benchmarking datasets. The team focused on creating human-annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. 

“Our team first translated English or Hokkien speech to Mandarin text, and then translated it to Hokkien or English,” said Juan Pino, researcher at Meta. “They then added the paired sentences to the data used to train the AI model.”

Meta AI’s Mark Zuckerberg demonstrates the company’s speech-to-speech AI translation model.

For the modeling, Meta AI applied recent advances in using self-supervised discrete representations as targets for prediction in speech-to-speech translation, and demonstrated the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Meta AI says it will also release a speech-to-speech translation benchmark set to facilitate future research in this field. 

William Falcon, AI researcher and CEO/cofounder of Lightning AI, said that artificial speech translation could play a significant role in the metaverse as it helps stimulate interactions and content creation.

“For interactions, it will enable people from around the world to communicate with each other more fluidly, making the social graph more interconnected. In addition, using artificial speech translation for content allows you to easily localize content for consumption in multiple languages,” Falcon told VentureBeat. 

Falcon believes that a confluence of factors, such as the pandemic having massively increased the amount of remote work, as well as reliance on remote working tools, have led to growth in this area. These tools can benefit significantly from speech translation capabilities.

“Soon, we can look forward to hosting podcasts, Reddit AMA, or Clubhouse-like experiences within the metaverse. Enabling those to be multicast in multiple languages expands the potential audience on a massive scale,” he said.

How Meta’s universal speech translator (UST) works 

The model uses S2UT to convert input speech to a sequence of acoustic units directly in the path, an implementation Meta previously pioneered. The generated output consists of waveforms from the input units. In addition, Meta AI adopted UnitY for a two-pass decoding mechanism where the first-pass decoder generates text in a related language (Mandarin), and the second-pass decoder creates units.

To enable automatic evaluation for Hokkien, Meta AI developed a system that transcribes Hokkien speech into a standardized phonetic notation called “Tâi-lô.” This allowed the data science team to compute BLEU scores (a standard machine translation metric) at the syllable level and quickly compare the translation quality of different approaches. 

The model architecture of UST with single-pass and two-pass decoders. The blocks in shade illustrate the modules that were pretrained. Image source: Meta AI.

In addition to developing a method for evaluating Hokkien-English speech translations, the team created the first Hokkien-English bidirectional speech-to-speech translation benchmark dataset, based on a Hokkien speech corpus called Taiwanese Across Taiwan. 

Meta AI claims that the techniques it pioneered with Hokkien can be extended to many other unwritten languages — and eventually work in real time. For this purpose, Meta is releasing the Speech Matrix, a large corpus of speech-to-speech translations mined with Meta’s innovative data mining technique called LASER. This will enable other research teams to create their own S2ST systems. 

LASER converts sentences of various languages into a single multimodal and multilingual representation. The model uses a large-scale multilingual similarity search to identify similar sentences in the semantic space, i.e., ones that are likely to have the same meaning in different languages. 

The mined data from the Speech Matrix provides 418,000-hour parallel speech to train the translation model, covering 272 language directions. So far, more than 8,000 hours of Hokkien speech have been mined together with the corresponding English translations.

A future of opportunities and challenges in speech translation

Meta AI’s current focus is developing a speech-to-speech translation system that does not rely on generating an intermediate textual representation during inference. This approach has been demonstrated to be faster than a traditional cascaded system that combines separate speech recognition, machine translation and speech synthesis models.

Yashar Behzadi, CEO and founder of Synthesis AI, believes that technology needs to enable more immersive and natural experiences if the metaverse is to succeed.

He said that one of the current challenges for UST models is the computationally expensive training that’s needed because of the breadth, complexity and nuance of languages.

“To train robust AI models requires vast amounts of representative data. A significant bottleneck to building these AI models in the near future will be the privacy-compliant collection, curation and labeling of training data,” he said. “The inability to capture sufficiently diverse data may lead to bias, differentially impacting groups of people. Emerging synthetic voice and NLP technologies may play an important role in enabling more capable models.”

According to Meta, with improved efficiency and simpler architectures, direct speech-to-speech could unlock near-human-quality real-time translation for future devices like AR glasses. In addition, the company’s recent advances in unsupervised speech recognition (wav2vec-U) and unsupervised machine translation (mBART) will aid the future work of translating more spoken languages within the metaverse. 

With such progress in unsupervised learning, Meta aims to break down language barriers both in the real world and in the metaverse for all languages, whether written or unwritten.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.