Facebook's voice synthesis AI generates speech in 500 milliseconds

Facebook today unveiled a highly efficient, AI text-to-speech (TTS) system that can be hosted in real time using regular processors. It's currently powering Portal, the company's brand of smart displays, and it's available as a service for other apps, like VR, internally at Facebook.

In tandem with a new data collection approach, which leverages a language model for curation, Facebook says the system -- which produces a second of audio in 500 milliseconds -- enabled it to create a British-accented voice in six months as opposed to over a year for previous voices.

Most modern AI TTS systems require graphics cards, field-programmable gate arrays (FPGAs), or custom-designed AI chips like Google's tensor processing units (TPUs) to run, train, or both. For instance, a recently detailed Google AI system was trained across 32 TPUs in parallel. Synthesizing a single second of humanlike audio can require outputting as many as 24,000 samples -- sometimes even more. And this can be expensive; Google's latest-generation TPUs cost between $2.40 and $8 per hour in Google Cloud Platform.

TTS systems like Facebook's promise to deliver high-quality voices without the need for specialized hardware. In fact, Facebook says its system attained a 160 times speedup compared with a baseline, making it fit for computationally constrained devices. Here's how it sounds:

[audio mp3="https://venturebeat.com/wp-content/uploads/2020/05/TTS_01_v002.mp3"][/audio]

"The system ... will play an important role in creating and scaling new voice applications that sound more human and expressive," the company said in a statement. "We're excited to provide higher-quality audio ... so that we can more efficiently continue to bring voice interactions to everyone in our community."

Components

Facebook's system has four parts, each of which focuses on a different aspect of speech: a linguistic front-end, a prosody model, an acoustic model, and a neural vocoder.

The front-end converts text into a sequence of linguistic features, such as sentence type and phonemes (units of sound that distinguish one word from another in a language, like p, b, d, and t in the English words pad, pat, bad, and bat). As for the prosody model, it draws on the linguistic features, style, speaker, and language embeddings -- i.e., numerical representations that the model can interpret -- to predict sentences' speech-level rhythms and their frame-level fundamental frequencies. ("Frame" refers to a window of time, while "frequency" refers to melody.)

Style embeddings let the system create new voices including "assistant," "soft," "fast," "projected," and "formal" using only a small amount of additional data on top of an existing training set. Only 30 to 60 minutes of data is required for each style, claims Facebook -- an order of magnitude less than the "hours" of recordings a similar Amazon TTS system takes to produce new styles.

Facebook's acoustic model leverages a conditional architecture to make predictions based on spectral inputs, or specific frequency-based features. This enables it to focus on information packed into neighboring frames and train a lighter and smaller vocoder, which consists of two components. The first is a submodel that upsamples (i.e., expands) the input feature encodings from frame rate (187 predictions per second) to sample rate (24,000 predictions per second). A second submodel similar to DeepMind's WaveRNN speech synthesis algorithm generates audio a sample at a time at a rate of 24,000 samples per second.

Performance boost

The vocoder's autoregressive nature -- that is, its requirement that samples be synthesized in sequential order -- makes real-time voice synthesis a major challenge. Case in point: An early version of the TTS system took 80 seconds to generate just one second of audio.

The nature of the neural networks at the heart of the system allowed for optimization, fortunately. All models consist of neurons, which are layered, connected functions. Signals from input data travel from layer to layer and slowly "tune" the output by adjusting the strength (weights) of each connection. Neural networks don't ingest raw pictures, videos, text, or audio, but rather embeddings in the form of multidimensional arrays like scalars (single numbers), vectors (ordered arrays of scalars), and matrices (scalars arranged into one or more columns and one or more rows). A fourth entity type that encapsulates scalars, vectors, and matrices -- tensors -- adds in descriptions of valid linear transformations (or relations).

With the help of a tool called PyTorch JIT, Facebook engineers migrated from a training-oriented setup in PyTorch, Facebook's machine learning framework, to a heavily inference-optimized environment. Compiled operators and tensor-level optimizations, including operator fusion and custom operators with approximations for the activation function (mathematical equations that determine the output of a model), led to additional performance gains.

Another technique called unstructured model sparsification reduced the TTS system's training inference complexity, achieving 96% unstructured sparsity without degrading audio quality (where 4% of the model's variables, or parameters, are nonzero). Pairing this with optimized sparse matrix operators on the inference model led to a 5 times speed increase.

Blockwise sparsification, where nonzero parameters are restricted to blocks of 16-by-1 and stored in contiguous memory blocks, significantly reduced bandwidth utilization and cache usage. Various custom operators helped attain efficient matrix storage and compute, so that compute was proportional to the number of nonzero blocks in the matrix. And knowledge distillation, a compression technique where a small network (called the student) is taught by a larger trained neural network (called the teacher), was used to train the sparse model, with a denser model as the teacher.

Finally, Facebook engineers distributed heavy operators over multiple processor cores on the same socket, chiefly by enforcing nonzero blocks to be evenly distributed over the parameter matrix during training and segmenting and distributing matrix multiplication among several cores during inference.

Data collection

Modern commercial speech synthesis systems like Facebook's use data sets that often contain 40,000 sentences or more. To collect sufficient training data, the company's engineers adopted an approach that relies on a corpus of open domain speech recordings -- utterances -- and selects lines from large, unstructured data sets. The data sets are filtered by a language model based on their readability criteria, maximizing the phonetic and prosodic diversity present in the corpus while ensuring the language remains natural and readable.

Facebook says this led to fewer annotations and edits for audio recorded by a professional voice actor, as well as improved overall TTS quality; by automatically identifying script lines from a more diverse corpus, the method let engineers scale to new languages rapidly without relying on hand-generated data sets.

Future work

Facebook next plans to use the TTS system and data collection method to add more accents, dialogues, and languages beyond French, German, Italian, and Spanish to its portfolio. It's also focusing on making the system even more light and efficient than it is currently so that it can run on smaller devices, and it's exploring features to make Portal's voice respond with different speaking styles based on context.

Last year, Facebook machine learning engineer Parthath Shah told The Telegraph the company was developing technology capable of detecting people's emotions through voice, preliminarily by having employees and paid volunteers re-enact conversations. Facebook later disputed this report, but the seed of the idea appears to have germinated internally. In early 2019, company researchers published a paper on the topic of producing different contextual voice styles, as well as a paper that explores the idea of building expressive text-to-speech via a technique called join style analysis.

Here's a sample:

[audio mp3="https://venturebeat.com/wp-content/uploads/2020/05/TTS_03_v004.mp3"][/audio]

"For example, when you’re rushing out the door in the morning and need to know the time, your assistant would match your hurried pace," Facebook proposed. "When you're in a quiet place and you're speaking softly, your AI assistant would reply to you in a quiet voice. And later, when it gets noisy in the kitchen, your assistant would switch to a projected voice so you can hear the call from your mom."

It's a step in the direction toward what Amazon accomplished with Whisper Mode, an Alexa feature that responds to whispered speech by whispering back. Amazon's assistant also recently gained the ability to detect frustration in a customer's voice as a result of a mistake it made, and apologetically offer an alternative action (i.e., offer to play a different song) -- the fruit of emotion recognition and voice synthesis research begun as far back as 2017.

Beyond Amazon, which offers a range of speaking styles (including a "newscaster" style) in Alexa and its Amazon Polly cloud TTS service, Microsoft recently rolled out new voices in several languages within Azure Cognitive Services. Among them are emotion styles like cheerfulness, empathy, and lyrical, which can be adjusted to express different emotions to fit a given context.

"All these advancements are part of our broader efforts in making systems capable of nuanced, natural speech that fits the content and the situation," said Facebook. "When combined with our cutting-edge research in empathy and conversational AI, this work will play an important role in building truly intelligent, human-level AI assistants for everyone."

Components

Performance boost

Data collection

Future work

More