DeepMind's WaveNet produces better human-like speech than Google's best systems

Google's DeepMind research lab today published its latest work in the area of speech synthesis, which is better known as text-to-speech (TTS). And the findings are fascinating: A trained version of the new WaveNet artificial neural network produces English and Chinese speech that sounds more natural than Google's latest implementations of two kinds of TTS systems.

WaveNet is a convolutional neural network, which is a popular system for the trendy artificial intelligence approach of deep learning. After being trained on lots of data, these systems can then make inferences about new data. But they can also be used to generate new data. The method is widely used, often for image recognition, at Google and other companies, like Facebook.

To train the WaveNet, the DeepMind researchers called on Google's single-speaker North American English and Mandarin TTS data from professional female speakers. Then they put it up against a parametric system that uses a hidden Markov model (HMM) and a concatenative system that relies on a long short-term memory recurrent neural network (LSTM-RNN) -- both relying on the same training data. The researchers had people rate how natural the speech sounded, in comparison with natural speech samples.

In both Mandarin and North American English, WaveNet performed "significantly better" than the parametric and concatenative systems, Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu wrote in their paper. WaveNet was not perceived to be more human than the actual human recordings.

Indeed, the WaveNet-generated speech samples that DeepMind is providing online today do sound human-like to me, or at least more human than the other systems. (Don't believe me? Listen for yourself.)

Google DeepMind took WaveNet beyond the domain of TTS and trained it on solo piano music on YouTube to produce new music. The results do not sound robotic; instead they're surprisingly expressive and passionate. (Again, check out the samples yourself.)

The DeepMind group didn't stop there. It also applied WaveNet to speech recognition. "We trained WaveNet with two loss terms, one to predict the next sample and one to classify the frame, the model generalized better than with a single loss and achieved 18.8 PER on the test set, which is to our knowledge the best score obtained from a model trained directly on raw audio on TIMIT," the researchers wrote.

DeepMind did not say if Google has started using WaveNet in its existing products. But that would not be surprising to see. DeepMind's AI already helps Google reduce energy use inside its data centers.

DeepMind received extensive media attention earlier this year when its AlphaGo artificially intelligent Go player beat high-ranking South Korean Go player Lee Sedol in a five-game series. Google acquired DeepMind for a reported $400 million in 2014. The group collaborates with the separate Google Brain team, although "the time zone difference between London and Mountain View makes really deep collaborations more challenging than one might like," as Google senior fellow Jeff Dean wrote in a Reddit AMA session last month.

More