Might street-navigating AI be able to traverse previously unseen neighborhoods given sufficient training data? That’s what scientists at Google parent company Alphabet’s DeepMind investigate in a newly published paper (“Cross-View Policy Learning for Street Navigation“) on the preprint server Arxiv.org. In it, they describe transferring an AI policy trained with a ground-view corpus to target parts of a city using top-down visual information, an approach they say results in better generalization.

The work was inspired by the observation that humans can quickly adapt to a new city by reading a map, said the paper’s coauthors.

“The ability to navigate from visual observations in unfamiliar environments is a core component of intelligent agents and an ongoing challenge … [G]oal-driven street navigation agents have not so far been able to transfer to unseen areas without extensive retraining, and relying on simulation is not a scalable solution,” they wrote. “Our core idea is to pair the ground view with an aerial view and to learn a joint policy that is transferable across views.”

The researchers first collected regional aerial maps that they paired with street-level views based on corresponding geographical coordinates. Next, they embarked on a three-part transfer learning task that kicked off with training on source region data and adaptation using the aerial-view target region observations, and which concluded with transfer to the target area using ground-view observations.

DeepMind AI

The team’s machine learning system comprised a trio of modules, including a convolutional module responsible for visual perception, a long short-term memory (LSTM) module that captured location-specific features, and a policy recurrent neural module that produced a distribution over actions. It was deployed in StreetAir, a multi-view outdoor street environment, built on top of StreetLearn, an interactive first-person collection of panoramic street-view photographs from Google’s Street View and Google Maps. Within StreetAir and StreetLearn, aerial images covering both New York City (Downtown NYC and Midtown NYC) and Pittsburgh (Allegheny and Carnegie Mellon University’s campus) were arranged such that at each latitude and longitude coordinate, the environment returned an 84 x 84 aerial image the same size as the ground view image centered at the location.

The AI system, once trained, was tasked with learning to both localize itself and navigate a Street View graph of panoramic images given the latitude and longitude coordinates of a goal destination. Panoramas covering areas between 2-5 kilometers a side were spaced by about 10 meters, and AI-guided agents were allowed one of five actions per turn: move forward, turn left or right by 22.5 degrees, or turn left or right by 67.5 degrees. Upon reaching within 100-200 meters of the goal, these agents received a reward to reinforce behaviors that led to quick and accurate traversal.

In experiments, agents that tapped the aerial images to adapt to new environments achieved a reward metric of 190 at 100 million steps and 280 at 200 million steps, both of which were significantly higher than that of agents that used only ground-view data (50 at 100 million steps and 200 at 200 million steps). The researchers say this indicates that their approach significantly improved the agents’ ability to gain knowledge about target city regions.

“Our results suggest that the proposed method transfers agents to unseen regions with higher zero-shot rewards (transfer without training in the held-out ground-view environment) and better overall performance (continuously trained during transfer) compared to single-view (ground-view) agents,” the team wrote.