Researchers at Facebook, the Georgia Institute of Technology, and Oregon State University describe in a preprint paper published this week a new task for AI — navigating a 3D environment by listening to natural language directions (e.g., “Go down the hall and turn left at the wooden desk”). They say this could lay the groundwork for robot assistants that follow natural language instructions.

The researchers’ task, which they call vision-and-language navigation in continuous environments (VLN-CE), takes place in Habitat, Facebook’s simulator that can train AI agents to operate in settings meant to mimic real-world environments. Agents represented by 1.5-meter-tall cylinders 0.2 meters in diameter are placed in interiors sourced from the Matterport3D data set, a collection of 90 environments captured through over 10,800 panoramas and corresponding 3D meshes. The agents must make one of four actions (move forward 0.25 meters, turn left or right 15 degrees, or stop at the goal position) along a path and learn to avoid getting stuck on obstacles, like chairs and tables.

The team distilled the environments into 4,475 trajectories consisting of 4 to 6 nodes, which corresponded to 360-degree panoramic images taken at locations and indicating navigability. They used this to train two AI models: a sequence-to-sequence model consisting of a policy that took a representation of visual observations and instructions and used them to predict an action, and a two-network cross-modal attention model that tracked observations and made decisions based on instructions and features.

Facebook autonomous robot navigation

VB Transform 2020 Online - July 15-17. Join leading AI executives: Register for the free livestream.

The researchers say in experiments the best-performing agents could follow instructions like “Turn left and enter the hallway,” even though these required the agents to turn an unknown number of times until they spotted visual landmarks. In point of fact, the agents navigated to goal locations in approximately a third of episodes in unseen environments, taking an average of 88 actions.

The agents occasionally failed, in one instance moving toward the wrong window while failing to first “pass the kitchen” as instructed. These failures were often the result of agents visually missing the objects referred to in the instructions, according to the coauthors.

“Crucially, VLN-CE … provides the [research] community with a testbed where these sort of integrative experiments studying the interface of high- and low-level control are possible,” wrote the coauthors.

Facebook has devoted considerable resources to solving the problem of autonomous robot navigation. In June, after revealing an initiative to teach six-legged hexapod robots to walk, it debuted PyRobot, a robot framework for its PyTorch machine learning framework. In 2018, the company open-sourced AI that can navigate New York City streets with 360-degree images. More recently, a Facebook team published a paper describing a system that learns how to get around an office by watching videos.