Researchers propose framework that trains robots to mimic human actions in captioned videos

Humans learn new skills from vids on the regular, so why not robots? That's the crux of a new preprint paper ("V2CNet: A Deep Learning Framework to Translate Videos to Commands for Robotic Manipulation") published by researchers at the Istituto Italiano di Tecnologia in Genova, Italy and the Australian Centre for Robotic Vision, which describes a deep learning framework that translates clips to natural language commands which can be used to train semiautonomous machines.

"While humans can effortlessly understand the actions and imitate the tasks by just watching someone else, making the robots to be able to perform actions based on observations of human activities is still a major challenge in robotics," the paper's authors wrote. "In this work, we argue that there are two main capabilities that a robot must develop to be able to replicate human activities: understanding human actions, and imitating them ... By understanding human actions, robots may acquire new skills, or perform."

Toward that end, the team proposes a pipeline optimized for two tasks: video captioning and action recognition. It comprises a recurrent neural network "translator" step that models the long-term dependencies of visual features from input demonstrations and generates a string of instructions, plus a classification branch with a convolutional network that encodes temporal information to categorize the fine-grained actions.

The input of the classification branch is a set of features extracted from the video frames by a pretrained AI model. As the researchers explain, the translation and classification components are trained in such a way that the encoder portion encourages the translator to generate the correct fine-grained action, enabling it to "understand" the videos it ingests.

"By jointly training both branches, the network can effectively encode both the spatial information in each frame and the temporal information across the time axis of the video," they said. "[I]ts output can be combined with the vision and planning modules in order to let the robots perform different tasks."

To validate the model, the researchers created a new data set -- video-to-command (IIT-V2C) -- consisting of videos of human demonstrations manually segmented into 11,000 short clips (2 to 3 seconds in length) and annotated with a command sentence describing the current action. They used a tool to automatically extract the verb from the command, and used this verb as the action class for each video, resulting in 46 classes total (e.g., cutting and pouring).

In experiments involving IT-V2C and different feature extraction methods and recurrent neural networks, the scientists say their model successfully encoded the visual features for each video and generated associated commands. It also outperformed recent state of the art by "a substantial margin," they claim -- chiefly because of the TCN network, which they say improved translation by effectively learning fine-grained actions.

The authors say they will make the data set and source code available in open source.

More