Researchers' AI system infers music from silent videos of musicians

In a study accepted to the upcoming 2020 European Conference on Computer Vision, MIT and MIT-IBM Watson AI Lab researchers describe an AI system -- Foley Music -- that can generate "plausible" music from silent videos of musicians playing instruments. They say it works on a variety of music performances and outperforms "several" existing systems in generating music that's pleasant to listen to.

It's the researchers' belief an AI model that can infer music from body movements could serve as the foundation for a range of applications, from adding sound effects to videos automatically to creating immersive experiences in virtual reality. Studies from cognitive psychology suggest humans possess this skill -- even young children report that what they hear is influenced by the signals they receive from seeing a person speak, for example.

Foley Music extracts 2D key points of people's bodies (25 total points) and fingers (21 points) from video frames as intermediate visual representations, which it uses to model body and hand movements. For the music, the system employs MIDI representations that encode the timing and loudness of each note. Given the key points and the MIDI events (which tend to number around 500), a "graph-transformer" module learns mapping functions to associate movements with music, capturing the long-term relationships to produce accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, and violin clips.

The MIDI events aren't rendered into music by the system, but the researchers note they can be imported into a standard synthesizer. The team leaves training a neural synthesizer to do this automatically to future work.

In experiments, the researchers trained Foley Music on three data sets containing 1,000 music performance videos belonging to 11 categories: URMP, a high-quality multi-instrument video corpus recorded in a studio that provides a MIDI file for each recorded video; AtinPiano, a YouTube channel including piano video recordings with the camera looking down on the keyboard and hands; and MUSIC, an untrimmed video data set downloaded by querying keywords from YouTube.

The researchers had the trained Foley Music system generate MIDI clips for 450 videos. Then, they conducted a listening study that tasked volunteers from Amazon Mechanical Turk with rating 50 of those clips across four categories:

Correctness: How relevant the generated song was to the video content.
Noise: Which song had the least noise.
Synchronization: Which song best temporally aligned with the video content.
Overall: Which song they preferred to listen to.

The evaluators found Foley Music's generated music harder to distinguish from real recordings than other baseline systems, the researchers report. Moreover, the MIDI event representations appeared to help improve sound quality, semantic alignment, and temporal synchronization.

"The results demonstrated that the correlations between visual and music signals can be well established through body keypoints and MIDI representations. We additionally show our framework can be easily extended to generate music with different styles through the MIDI representations," the coauthors wrote. "We envision that our work will open up future research on studying the connections between video and music using intermediate body keypoints and MIDI event representations."

Foley Music comes a year after researchers at MIT's Computer Science and Artificial Intelligence Lab (CSAIL) detailed a system -- Pixel Player -- that used AI to distinguish between and isolate sounds of instruments. The fully trained PixelPlayer, given a video as the input, splits the accompanying audio and identifies the source of sound and then calculates the volume of each pixel in the image and "spatially localizes" it -- i.e., identifies regions in the clip that generate similar sound waves.