Google open-sources data set to train and benchmark AI sound separation models

Google today announced the release of a new data set -- the Free Universal Sound Separation data set, or FUSS for short -- intended to support the development of AI models that can separate distinct sounds from recording mixes. The use cases are potentially endless, but if it were to be commercialized, FUSS could be used in corporate settings to extract speech from conference calls.

It follows on the heels of a study by Google and the Idiap Research Institute in Switzerland describing two machine learning models -- a speaker recognition network and a spectrogram masking network -- that together "significantly" reduced the speech recognition word error rate (WER) on multispeaker signals. Elsewhere, tech giants including Alibaba and Microsoft have invested significant time and resources in solving the sound separation problem.

As Google Research scientists John Hershey, Scott Wisdom, and Hakan Erdogan explain in a blog post, the bulk of sound separation models assume the number of sounds in a mixture to be static, and they either separate mixtures of a small number of sound types (such as speech versus nonspeech) or different instances of the same sound type (like a first speaker versus a second speaker). The FUSS data set shifts the focus to the more general problem of separating a variable number of arbitrary sounds from one another.

To this end, the FUSS data set includes a diverse set of sounds, a realistic room simulator, and code to mix these elements together for multi-source, multi-class audio with ground truth. Sourcing audio clips from FreeSound.org filtered to exclude those that aren't separable by humans when mixed together, Google researchers compiled 23 hours of audio consisting of 12,377 sounds useful for mixing, from which they generated 20,000 mixtures for training an AI model, 1,000 mixtures for validating it, and 1,000 mixtures for evaluating it.

The researchers say they developed their own room simulator using Google's TensorFlow machine learning framework, which generates the impulse response of a box-shaped room with "frequency-dependent" reflective properties given a sound source and a mic location. FUSS ships with the precalculated room impulse responses used for each audio sample, along with mixing code. That's complemented by a pretrained, masking-based separation model that can reconstruct multi-source mixtures with high accuracy.

The Google team plans to release the code for the room simulator and to extend the simulator to address more computationally expensive acoustic properties, as well as materials with different reflective properties and novel room shapes. "Our hope is [the FUSS data set] will lower the barrier to new research, and particularly will allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge," wrote Hershey, Wisdom, and Erdogan.

The FUSS data set is available on GitHub, and it will be used in the DCASE challenge as a component of the Institute of Electrical and Electronics Engineers' (IEEE) Sound Event Detection and Separation task. The released sound separation model will serve as a baseline for this competition and a benchmark to demonstrate progress against in future experiments.

More