Humans draw on an implicit understanding of the physical world to predict the motion of objects — and to infer interactions between them. If you’re presented with three frames showing toppling of cans — one with the cans stacked neatly on top of each other, the second with a finger at the stack’s base, and a third showing the cans lying on their sides — you might guess that the finger was responsible for their demise.

Robots struggle to make those logical leaps. But in a paper from the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory, researchers describe a system — dubbed a Temporal Relation Network (TRN) — that essentially learns how objects change over time.

They aren’t the first to do so — Baidu and Google are among the firms who’ve investigated AI-assisted spatial-temporal modeling — but the team from MIT claim their method strikes a good balance between the accuracy and efficiency of previous approaches.

MIT CSAIL object tracking

“We built an artificial intelligence system to recognize the transformation of objects, rather than [the] appearance of objects,” Bolei Zhou, a lead author on the paper, told MIT News. “The system doesn’t go through all the frames — it picks up key frames [sic] and, using the temporal relation of frames, recognize what’s going on. That improves the efficiency of the system and makes it run in real time accurately.”

The researchers trained a convolutional neural network — a class of machine learning model that’s highly adept at analyzing visual imagery — on three datasets: TwentyBN’s Something-Something, which consists of more than 20,000 videos in 174 action categories; Jester, which has 150,000 videos with 27 hand gestures; and Carnegie Mellon University’s Charades, which comprises 10,000 videos of 157 categorized activities.

They then set the network loose on video files, which it processed by ordering frames in groups and assigning a probability that on-screen objects matched a learned activity — like tearing a piece of paper, for example, or raising a hand.

So how’d it do? The model managed to achieve 95 percent accuracy for the Jester dataset and outperformed existing models on forecasting activities given a limited amount of information. After processing just 25 percent of a video’s frames, it beat the baseline and even managed to distinguish between actions like “pretending to open a book” versus “opening a book.”

In future studies, the team plans to improve the model’s sophistication by implementing object recognition and adding “intuitive physics” — i.e., an understanding of the real-world properties of objects.

“Because we know a lot of the physics inside these videos, we can train module[s] to learn such physics laws and use those in recognizing new videos,” Zhou said. “We also open-source all the code and models. Activity understanding is an exciting area of artificial intelligence right now.”