Facebook's AI learns how to get around an office by watching videos

Humans undertake high-level planning every day, but it's not so easy for robots. Fortunately, a growing body of work suggests that hierarchal abstractions (namely visuomotor subroutines) can boost sample efficiency in reinforcement learning, an AI training technique that employs rewards to drive agents toward goals. Traditionally, these hierarchies must be handcoded or acquired via end-to-end training, which requires time, attention, and lots of patience. But in a newly published preprint paper ("Learning Navigation Subroutines by Watching Videos") on Arxiv.org, scientists at Facebook AI Research, the University of California at Berkeley, and the University of Illinois Urbana-Champaign describe a system that learns hierarchies by ingesting videos "pseudo-labeled" with an inverse machine learning model.

It calls to mind a pair of models Facebook open-sourced last year -- Talk the Walk -- that can navigate the streets of New York City using only 360-degree images, natural language, and a map with local landmarks like banks and restaurants for guidance.

"Every morning, when you decide to get a cup of coffee from the kitchen, you think of going down the hallway, turning left into the corridor and then entering the room on the right. Instead of deciding the exact muscle torques, you plan at this higher level of abstraction by composing these reusable lower level visuomotor subroutines to reach your goal," explained the coauthors. "These visuomotor subroutines ... enable planning which mitigates the known issue of high computational cost in classical planning and high sample complexity in reinforcement learning."

In the first phase of their proposed two-phase system, the researchers generated pseudo-labels by running a model trained with an agent using self-supervision on random exploration data. (The pseudo-labels are actions imagined by the agent, in this context.) In total, the model learned from 1,500 different locations spread over four environments and executed actions randomly for 30 steps, producing 45,000 interaction samples.

In the system's second phase, roughly 217,000 pseudo-labeled videos sliced into 2.2 million individual clips were fed into a model that predicted corresponding actions taken in the reference video, while a separate network examined the sequence of actions in the reference video and encoded the behavior as a vector (e.g., a mathematical representation). Yet another model predicted which learned subroutines could be invoked for any given video frame by anticipating the inferred encoding of the trajectory from the first frame.

In experiments with a real-world robot deployed in an office environment, the researchers showed that passive videos for learning skills (i.e., the most efficient way to travel to a target location) resulted in better performance than purely interactive methods, at least with respect to previously unseen environments. Perhaps most impressive of all, the trained models learned to favor forward navigation and to avoid obstacles up to four times faster in navigation tasks than the next best baseline, which enabled the above-mentioned robot to travel long distances completely autonomously.

"It is particularly striking that [the models] learned from a total of 45,000 interactions with the environment," the researchers wrote. "Successful learning from first-person videos allowed the agent to execute coherent trajectories, even though it had only ever executed random actions ... Furthermore, it outperforms state-of-the-art learning based techniques for learning skills, that were trained on multiple orders of magnitude more interaction sampled (45,000 versus 10 million)."

More