In a new preprint study, researchers at Carnegie Mellon University claim sound can be used to predict an object’s appearance — and its motion. The coauthors created a “sound-action-vision” data set and a family of AI algorithms to investigate the interactions between audio, visuals, and movement. They say the results show representations derived from sound can be used to anticipate where objects will move when subjected to physical force.

While vision is foundational to perception, sound is arguably as important. It captures rich information often imperceptible through visual or force data, like the texture of dried leaves or the pressure inside a champagne bottle. But few systems and algorithms have exploited sound as a vehicle to build physical understanding. This oversight motivated the Carnegie Mellon study, which sought to explore the synergy between sound and action and discover what sort of inferences might be made.

The researchers first created the sound-action-vision data set by building a robot — Til-Bot — to tilt objects, including screwdrivers, scissors, tennis balls, cubes, and clamps, on a tray in random directions. The objects hit the thin walls of the plaster tray and produced sounds, which were added to the corpus one by one.

Tilt-Bot

Four microphones mounted to the 30×30-centimeter tray (one on each side) recorded audio while an overhead camera captured RGB and depth information. Tilt-Bot moved each object around for an hour, and every time the object made contact with the tray, the robot created a log containing the sound, RGB and depth data, and tracking location of the object as it collided with the walls.

With the audio recordings from the collisions, the team used a method that enabled them to treat the recordings as images. This allowed the models to capture temporal correlations from single audio channels (i.e., recordings by one microphone) as well as correlations among multiple audio channels (recordings from several microphones).

The researchers then used the corpus — which contained sounds from 15,000 collisions between over 60 objects and the tray — to train a model to identify objects from audio. In a second, more challenging exercise, they trained a model to predict what actions were applied to an unseen object. In a third, they trained a forward prediction model to suss out the location of objects after they’d been pushed by a robotic arm.

Tilt-Bot

Above: Forward model predictions are visualized here as pairs of images. The left image is the observation before the interaction, while the right image is the observation after the interaction. Based on the object ground truth location (shown as the green dot) before interaction, the audio embedding of the object and action taken by the robot (shown as a red arrow), trained forward model predicts the future object location (shown as a red dot).

The object-identifying model learned to predict the right object from sound 79.2% of the time, failing only when the generated sounds were too soft, according to the researchers. Meanwhile, the action prediction model achieved a mean squared error of 0.027 on a set of 30 previously unseen objects, or 42% better than a model trained only with images from the camera. And the forward prediction model was more accurate in its projections about where objects might move.

“In some domains, like forward model learning, we show that sound in fact provides more information than can be obtained from visual information alone,” the researchers wrote. “We hope that the Tilt-Bot data set, which will be publicly released, along with our findings, will inspire future work in the sound-action domain and find widespread applicability in robotics.”