Google's WaveNetEQ fills in speech gaps during Duo calls

Google today detailed an AI system called WaveNetEQ that it recently deployed to Duo, the company's cross-platform voice and video chat app. Duo can realistically synthesize short snippets of speech to replace garbled audio caused by an unstable internet connection. It's fast enough to run on a smartphone while delivering state-of-the-art, natural-sounding audio quality, laying the groundwork for future chat apps optimized for bandwidth-constrained environments.

Here's how it sounds compared with Duo's old solution (the first is WaveNetEQ):

[audio wav="https://venturebeat.com/wp-content/uploads/2020/04/waveneteq_120_ms_2_63b829581a3291c144a030639139c199.wav"][/audio]

[audio wav="https://venturebeat.com/wp-content/uploads/2020/04/neteq_120_ms_2_8e86d7b2061dfb964b845ebefc1aebd9.wav"][/audio]

As Google explains, to ensure reliable real-time communication, it's necessary to deal with packets (i.e., formatted units of data) that are missing when the receiver needs them. (The company says that 99% of Duo calls need to deal with network issues and 10% of calls lose more than 8% of the total audio duration due to network issues.) If new audio isn't delivered continuously, audible glitches and gaps will occur, but repeating the same audio isn't ideal because it produces artifacts and reduces overall call quality.

Google's solution -- WaveNetEQ -- is what's called a packet loss containment module, which is responsible for creating data to fill in the gaps created by packet losses, excessive jitter, and other mishaps.

Architecturally, WaveNetEQ is a modified version of DeepMind's WaveRNN, a machine learning model for speech synthesis consisting of autoregressive and conditioning networks. The autoregressive network provides short- and mid-term speech structure by having each generated sample depend on the network's previous outputs, while the conditioning network influences the autoregressive network to produce audio consistent with the more slowly moving input features.

WaveNetEQ uses the autoregressive network to provide the audio continuation and the conditioning network to model long-term features, like voice characteristics. The spectrogram -- i.e., the visual representation of the spectrum of frequencies -- of the past audio signal is used as input for the conditioning network, which extracts information about the prosody and textual content. This condensed information is fed to the autoregressive network, which combines it with the audio of the recent past to predict the next sample in the waveform domain.

To train the WaveNetEQ model, Google used the autoregressive network samples from a training data set as input for the next step, rather than using the last sample the model produced. This was to ensure WaveNetEQ learned valuable speech information -- even at an early stage of training, when its predictions were still low quality. The aforementioned corpus contained voice recordings from 100 speakers in 48 different languages, as well as a wide variety of background noises to ensure that the model could deal with noisy environments.

Once WaveNetEQ was fully trained and put to use in Duo audio and video calls, the training was only used to "warm up" the model for the first sample. In production, WaveNetEQ's output is passed back as input for the next step.

WaveNetEQ is applied to the audio data in Duo's jitter buffer so that once the real audio continues after packet loss, it seamlessly merges the synthetic and real audio stream. To find the best alignment between the two signals, the model generates slightly more output than is required and then cross-fades from one to the other, avoiding noticeable noise.

Google says that in practice WaveNetEQ can plausibly finish syllables up to 120 milliseconds in length.

WaveNetEQ is already available in Duo on the Pixel 4 and Pixel 4 XL -- they arrived on March 3 -- and Google says it's in the process of rolling out the system to additional devices. Models with Qualcomm Snapdragon 855 system-on-chips should have it now, and those with Snapdragon 845 chipsets will get it in the coming days.

More