Researchers develop AI framework that predicts object motion from image and tactile data

Recent AI research has pointed out the synergies between touch and vision. One enables the measurement of 3D surface and inertial properties, while the other provides a holistic view of objects' projected appearance. Building on this work, researchers at Samsung, McGill University, and York University investigated whether an AI system could predict the motion of an object from visual and tactile measurements of its initial state.

"Previous research has shown that it is challenging to predict the trajectory of objects in motion, due to the unknown frictional and geometric properties and indeterminate pressure distributions at the interacting surface," the researchers wrote in a paper describing their work. "To alleviate these difficulties, we focus on learning a predictor trained to capture the most informative and stable elements of a motion trajectory."

The researchers developed a sensor called See-Through-your-Skin that they claim can capture images while providing detailed tactile measurements. Alongside this, they created a framework called Generative Multimodal Perception that exploits visual and tactile data when available to learn a representation that encodes information about object pose, shape, and force and make predictions about object dynamics. To anticipate the resting state of an object during physical interactions, they used what they call resting state predictions, along with a visuotactile dataset of motions in dynamic scenes, including objects freefalling on a flat surface, sliding down an inclined plane, and perturbed from their resting pose.

In experiments, the researchers say their approach was able to predict the raw visual and tactile measurements of the resting configuration of an object with high accuracy, with the predictions closely matching the ground truth labels. Moreover, they claim their framework learned a mapping between the visual, tactile, and 3D pose modes so it could handle missing modalities, such as when tactile information was unavailable in the input. It could also predict instances where an object had fallen from the surface of the sensor, resulting in empty output images.

"If a previously unseen object is dropped into a human's hand, we are able to infer the object's category and guess at some of its physical properties, but the most immediate inference is whether it will come to rest safely in our palm or if we need to adjust our grasp on the object to maintain contact," the coauthors wrote. "[In our work,] we find that predicting object motions in physical scenarios benefits from exploiting both modalities: Visual information captures object properties such as 3D shape and location, while tactile information provides critical cues about interaction forces and resulting object motion and contacts."