BabyWalk AI breaks complex navigation into simple steps

In a paper published this week on the preprint server Arxiv.org, researchers affiliated with Google, Princeton, the University of Southern California, and Simon Fraser University propose BabyWalk, an AI that learns to navigate by breaking instructions into steps and completing them sequentially. They claim it achieves state-of-the-art results on several metrics and that it's able to follow long instructions better than previous approaches.

If BabyWalk works as well in practice as the paper's coauthors assert, it could be a boon for companies developing autonomous machines bound for homes and factory floors. Highly robust robots must navigate the world by inferring their whereabouts from visual information (i.e., camera images), trajectories, and natural language instructions. Problematically, this entails training AI on an immense amount of data, which is computationally costly.

By contrast, BabyWalk adopts a two-phase learning process with a special memory buffer to turn its past experiences into contexts for future steps.

In the first phase, BabyWalk learns from demonstrations (a process known as imitation learning) to accomplish the shorter steps. It's given the steps paired with paths drawn by humans so it can internalize actions from shorter instructions; BabyWalk is tasked with following instructions so its trajectory matches the human's, given context from the trajectory up to the latest step.

In the second phase, the agent is provided all of the human-drawn trajectories, historical context, and a long navigation instruction involving a number of steps. Here, BabyWalk employs curriculum-based reinforcement learning to maximize rewards on the navigation task with increasingly longer instructions.

In experiments, the researchers trained BabyWalk on Room4Room, a benchmark for visually grounded natural language navigation in real buildings. Given 233,532 instructions with an average length of 58.4, the agent had to learn roughly 3.6 steps per instruction.

Judged by success rate, which measures the rate an agent stops within a specified distance near a goal location, BabyWalk achieved an average accuracy of 27.6% across Room4Room and previously unseen data sets. That might seem low, but on another metric -- coverage weighted by length score, which measures whether ground-truth paths are followed -- BabyWalk outperformed all other baselines with 47.9% accuracy. Moreover, on success rate weighted normalized dynamic time warping (SDTW), a separate metric that considers the similarity of paths by the agent and humans, BabyWalk once again beat baselines with 17.5% accuracy.

In future work, the researchers plan to investigate ways the gap between short and long tasks might be shortened and to tackle more complicated variations between learning settings and the real physical world. In the near term, they plan to release BabyWalk's code and training data sets on GitHub.

Combined with other emerging techniques in robotics, BabyWalk could form the basis of an impressively self-sufficient machine. Google researchers recently proposed AI that enables robots to make decisions on the fly; teaches them how to move by watching animals; and helps robots navigate around humans in offices. And the coauthors of an Amazon paper described a robot that asks questions when it's confused about instructions.