In a paper published on the preprint server Arxiv.org, researchers at MIT CSAIL, Nvidia, the University of Washington, and the University of Toronto describe an AI system that learns the physical interactions affecting materials like fabric by watching videos. They claim the system can extrapolate to interactions it hasn’t seen before, like those involving multiple shirts and pants, enabling it to make long-term predictions.
Causal understanding is the basis of counterfactual reasoning, or the imagining of possible alternatives to events that have already happened. For example, in an image containing a pair of balls connected to each other by a spring, counterfactual reasoning would entail predicting the ways the spring affects the balls’ interactions.
The researchers’ system — a Visual Causal Discovery Network (V-CDN) — guesses at interactions with three modules: one for visual perception, one for structure inference, and one for dynamics prediction. The perception model is trained to extract certain keypoints (areas of interest) from videos, from which the interference module identifies the variables that govern interactions between pairs of keypoints. Meanwhile, the dynamics module learns to predict the future movements of the keypoints, drawing on a graph neural network created by the inference module.
The researchers studied V-CDN in a simulated environment containing fabric of various shapes: shirts, pants, and towels of varying appearances and lengths. They applied forces on the contours of the fabrics to deform them and move them around, with the goal of producing a single model that could handle fabrics of different types and shapes.
The results show that V-CDN’s performance increased as it observed more video frames, according to the researchers, correlating with the intuition that more observations provide a better estimate of the variables governing the fabrics’ behaviors. “The model neither assumes access to the ground truth causal graph, nor … the dynamics that describes the effect of the physical interactions,” they wrote. “Instead, it learns to discover the dependency structures and model the causal mechanisms end-to-end from images in an unsupervised way, which we hope can facilitate future studies of more generalizable visual reasoning systems.”
The researchers are careful to note that V-CDN doesn’t solve the grand challenge of causal modeling. Rather, they see their work as an initial step toward the broader goal of building physically grounded “visual intelligence” capable of modeling dynamic systems. “We hope to draw people’s attention to this grand challenge and inspire future research on generalizable physically grounded reasoning from visual inputs without domain-specific feature engineering,” they wrote.