MIT's PixelPlayer can isolate the sounds of instruments using AI

Equalizers are one way to pump up the bass in your favorite tunes, but researchers at the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Lab (CSAIL) have a better solution. Their system -- PixelPlayer -- uses artificial intelligence to distinguish between and isolate the sounds of instruments, and make them louder or softer.

The fully trained PixelPlayer system, given a video as the input, splits the accompanying audio and identifies the source of sound, and then calculates the volume of each pixel in the image and "spatially localizes" it -- i.e., identifies regions in the clip that generate similar sound waves.

It's detailed in "The Sound of Pixels," a new paper accepted at the upcoming European Conference on Computer Vision, scheduled for September in Munich, Germany.

"We expected a best-case scenario where we could recognize which instruments make which kinds of sounds," Hang Zhao, a Ph.D student at CSAIL and a coauthor on the paper, said. "We were surprised that we could actually spatially locate the instruments at the pixel level. Being able to do that opens up a lot of possibilities, like being able to edit the soundtrack audio of individual instruments by a single click on the video."

At the core of PixelPlayer is a neural network trained on MUSIC (Multimodal Sources of Instrument Combinations), a dataset of 714 untrimmed, unlabeled videos from YouTube. (Five hundred videos -- 60 hours' worth -- were used for training, and the rest were used for validation and testing.) During the training process, the researchers fed the algorithm clips of performers playing acoustic guitars, cellos, clarinets, flutes, and other instruments.

It's just one part of PixelPlayer's multipronged machine learning framework. After the trained video analysis algorithm extracts visual features from the clips' frames, a second neural network -- an audio analysis network -- splits the sound into components and extracts features from them. Finally, an audio synthesizer network uses the output from the two networks to associate specific pixels with sound waves.

PixelPlayer is entirely self-supervised, meaning that it doesn't require humans to annotate the data, and it's capable of identifying the sounds of more than 20 instruments. (Zhao said a larger dataset would allow it to recognize more, but that it would have trouble handling subtle differences between subclasses of instruments.) It can also recognize elements of music, like harmonic frequencies from a violin.

The researchers think PixelPlayer could aid in sound editing, or be used on robots to better understand environmental sounds that animals, vehicles, and other objects make.

"We expect our work can open up new research avenues for understanding the problem of sound source separation using both visual and auditory signals," they wrote.

More