Baidu launches simultaneous language translation AI

Baidu has developed an AI system capable of simultaneously translating two languages at once. Aptly dubbed Simultaneous Translation and Anticipation and Controllable Latency (STACL), the Beijing company claims it represents a "major breakthrough" in natural language processing.

STACL, unlike most AI translation systems, is capable of beginning a translation just a few seconds into a speaker's speech and finishing seconds after the end of a sentence. It's the opposite of consecutive interpretation, where a translator waits until the speaker pauses to start translating.

Baidu said it tackled the challenge by modeling the system after human interpreters. STACL directly predicts the target language words in the translation, and fuses translation and anticipation into a single model -- "wait-k" -- which always translates k words behind the speaker's speech to allow context for prediction. (The system's trained to use the available prefix of the source sentence to decide the next word in the translation.)

Here's how Baidu explains it:

"In [the example] Bùshí Zǒngtǒng zài Mòsīkē ('Bush President in Moscow') and the English translation so far 'President Bush,' which is k=2 words behind Chinese, our system accurately predicts that the next translation word must be 'meet' because Bush is likely 'meeting' someone (e.g., Putin) in Moscow, long before the Chinese verb appears."

STACL's other key advantage is flexibility in latency. It can be set lower or higher depending on how closely the two languages are related -- lower for French and Spanish, for example, and higher for distant languages such as English and Chinese, or languages with different word orders such as English and German.

"It is more common for translation quality to suffer with low latency requirements, but our system sacrifices only a small loss in quality compared to conventional full-sentence (e.g. non-simultaneous) translation," Baidu wrote. "We are continuing to improve translation quality given low latency requirements."

So how does STACL compare to human interpreters? According to Baidu, it's about 3.4 BLEU points less than conventional full-sentence translation (BLEU, short for "bilingual evaluation understudy," is a standard metric for evaluating machine-translated text). And in Chinese-to-English simultaneous translation where the AI system lags behind the Chinese speech by about three seconds, the translation quality is 3.4 BLEU points lower than full-sentence (non-simultaneous) translation.

"Even with the latest advancement, we are fully aware of the many limitations of a simultaneous machine translation system," Baidu wrote. "The release of STACL is not intended to replace human interpreters, who will continue to be depended upon for their professional services for many years to come, but rather to make simultaneous translation more accessible."

Baidu's comes just months after Baidu announced that DuerOS, its conversational AI assistant, has reached an install base of 100 million devices, up from 50 million six months prior.

"We used to be a search company, but in the AI era, we want to be an AI platform company," Baidu executive Kun Jing told VentureBeat in an interview last year.

STACL advances the firm's earlier work in speech recognition -- and more broadly, in AI. In 2016 and 2017, Baidu launched SwiftScribe, a web app powered by its DeepSpeech platform, and TalkType, a dictation-centric Android keyboard, respectively. And more recently, in July, it unveiled a custom-designed AI chip -- Kunlun AI -- for edge and cloud computing, alongside Baidu Brain 3.0, a suite of 110 AI services ranging from natural language processing to computer vision.

Baidu's not the only company making waves in AI-powered translation and transcription. Microsoft in March demonstrated a system that matched human performance in translating news from Chinese to English. Facebook has begun leveraging unsupervised machine learning to translate content from one language to another. And researchers from the University of Toronto developed an offline speech recognition model that's 97 percent accurate.

More