In a paper accepted by the 2020 International Conference on Machine Learning (ICML), researchers at Facebook describe a method for isolating up to five voices speaking simultaneously on a single microphone. The team claims their method surpasses previous state-of-the-art performance on several speech-source separation benchmarks, including with challenging noise and reverberations.

Separating speech from conversations is a crucial step toward improving communication across a range of applications, like voice messaging and video tools. Beyond this, speech separation techniques like those proposed by researchers can be applied to the problem of background noise suppression, for example in recordings of musical instruments.

Here’s an audio recording of two speakers:

And here’s the speech Facebook’s model managed to separate:

 

The researchers used a novel recurrent neural network to build their model, a class of algorithm that employs a memory-like internal state to process variable-length sequences of inputs (e.g., audio). The model leverages an encoder network that maps raw audio waveforms to a latent representation. A voice separation network then transforms these representations into an estimated audio signal for each speaker. This “encoder” model needs foreknowledge of the total number of speakers, but a subsystem can automatically detect the speakers and select the speech model accordingly.

The researchers trained different models for separating two, three, four, and five speakers, feeding the input mixture to the model designed to accommodate up to five speakers so it would detect the number of audio channels present. Then they repeated the same process with a model trained for the number of active speakers and checked to see if any output channels were active, stopping either when all channels were active or when they found the model with the lowest number of target speakers.

The researchers believe the system could improve audio quality for people with hearing aids, making it easier to hear in crowded and noisy environments, such as at parties and restaurants. As a next step, they plan to prune and optimize the model until it achieves sufficiently high performance in the real world.

Facebook’s work follows the publication of a Google paper that proposes mixture invariant training (MixIT), an unsupervised approach to separating, isolating, and enhancing the voices of multiple speakers in an audio recording. The coauthors claimed that approach requires only single-channel (e.g., monaural) acoustic features to “significantly” improve speech separation performance by incorporating reverberant mixtures and a large amount of in-the-wild training data.