Google's SoundFilter AI separates any sound or voice from mixed-audio recordings

Researchers at Google claim to have developed a machine learning model that can separate a sound source from noisy, single-channel audio based on only a short sample of the target source. In a paper, they say their SoundFilter system can be tuned to filter arbitrary sound sources, even those it hasn't seen during training.

The researchers believe a noise-eliminating system like SoundFilter could be used to create a range of useful technologies. For instance, Google drew on audio from thousands of its own meetings and YouTube videos to train the noise-canceling algorithm in Google Meet. Meanwhile, a team of Carnegie Mellon researchers created a "sound-action-vision" corpus to anticipate where objects will move when subjected to physical force.

SoundFilter treats the task of sound separation as a one-shot learning problem. The model receives as input the audio mixture to be filtered and a single short example of the kind of sound to be filtered out. Once trained, SoundFilter is expected to extract this kind of sound from the mixture if present.

SoundFilter adopts what's known as a wave-to-wave neural network architecture that can be trained using audio samples without requiring labels that denote the type of source. A conditioning encoder takes the conditioning audio and computes the corresponding embedding (i.e., numerical representation), while a conditional generator takes the mixture audio and the conditioning embedding as input and produces the filtered output. The system assumes that the original audio collection consists of many clips a few seconds in length that contain the same type of sound for the whole duration. Beyond this, SoundFilter assumes that each such clip contains a single audio source, such as one speaker, one musical instrument, or one bird singing.

The model is trained to produce the target audio, given the mixture and the conditioning audio as inputs. A SoundFilter training example consists of three parts:

The target audio, which contains only one sound
A mixture, which contains two different sounds, one of which is the target audio
A conditioning audio signal, which is another example containing the same kind of sound as the target audio

In experiments, the researchers trained SoundFilter on two open source datasets: FSD50L (a collection of over 50,000 sounds) and LibriSpeech (around 1,000 hours of English speech). They report that the conditioning encoder learned to produce embeddings that represent the acoustic characteristics of the conditioning audio, enabling SoundFilter to successfully separate voices from mixtures of speakers, sounds from mixtures of sounds, and individual speakers/sounds from mixtures of speakers and sounds.

Here's one sample before SoundFilter processed it:

[audio wav="https://venturebeat.com/wp-content/uploads/2020/11/download-1.wav"][/audio]

Here's the sample post-processing:

[audio wav="https://venturebeat.com/wp-content/uploads/2020/11/download.wav"][/audio]

Here's another sample:

[audio wav="https://venturebeat.com/wp-content/uploads/2020/11/download-6.wav"][/audio]

And here's the post-processed result:

[audio wav="https://venturebeat.com/wp-content/uploads/2020/11/download-7.wav"][/audio]

"Our work could be extended by exploring how to use the embedding learned as part of SoundFilter as a representation for an audio event classifier," the researchers wrote. "In addition, it would be of interest to extend our approach from one-shot to many-shot."

More