In a paper published on the preprint server Arxiv.org, researchers at Google and the University of Illinois propose mixture invariant training (MixIT), an unsupervised approach to separating, isolating, and enhancing the voices of multiple speakers in an audio recording. This approach requires only single-channel (e.g., monaural) acoustic features, and researchers claim it “significantly” improves speech separation performance by incorporating reverberant mixtures and a large amount of in-the-wild training data.
As the paper’s coauthors point out, audio perception suffers a fundamental problem — sounds are mixed together in a way that’s impossible to disentangle without knowledge of the sources’ characteristics. Attempts have been made to design algorithms capable of estimating each sound source from single-channel recordings, but most to date are supervised, meaning they train on audio mixtures created by adding sounds with or without simulations of the environment. The result is that they fare poorly in the presence of acoustic reverberation or when there’s a mismatch in the distribution of sound types. This is due to several factors. First, it’s tough to match the characteristics of a real corpus, and the room characteristics are sometimes unknown. Then, data of every source type in isolation might not be readily available, and accurately simulating realistic acoustics is also difficult.
MixIT claims to solve these challenges by using acoustic mixtures without references. Training examples are constructed by mixing together existing audio mixtures, and the system divides them into a number of sources, with the separated sources remixed to approximate the original.
In experiments, MixIT was trained using four Google Cloud tensor processing units (TPU) to tackle three tasks: speech separation, speech enhancement, and universal sound separation. For speech separation, the researchers drew on the open source WSJ0-2mix and Libri2Mix data sets to extract over 390 hours of recordings of male and female speakers. They added a reverberation effect before feeding a mixture of the two sets (three-second clips from WSJ0-2mix and 10-second clips from Libri2Mix) to the model.
For the speech enhancement task, they collected non-speech sounds from FreeSound.org to test whether MixIT could be trained to remove noisy audio from a mixture containing LibriSpeech voices. And for the universal sound separation task, they used the recently released Free Universal Sound Separation data set to train MixIT to separate arbitrary sounds from an acoustic mixture.
The researchers report that in universal sound separation and speech enhancement, unsupervised training wasn’t as helpful compared with existing approaches — presumably because the test sets were “well-matched” to the supervised training domain. However, for universal sound separation, unsupervised training appeared to help slightly with generalization to the test set relative to the supervised-only training. While it didn’t reach supervised levels, the coauthors claim MixIT’s no-supervision performance was “unprecedented.”
Here’s a recording fed into the model:
Here are the separate audio sources:
Here’s another recording fed into the model:
And here’s what the model isolated:
“MixIT opens new lines of research where massive amounts of previously untapped in-the-wild data can be leveraged to train sound separation systems,” the researchers wrote. “An ultimate goal is to evaluate separation on real mixture data; however, this remains challenging because of the lack of ground truth. As a proxy, future experiments may use recognition or human listening as a measure of separation, depending on the application.”