Facebook AI researchers have created a pair of AI systems that are able to navigate the streets of New York City using only 360-degree images, natural language, and a map with local landmarks like banks and restaurants for guidance. The research task and dataset named Talk the Walk is being open-sourced today alongside initial results of the real-world training being published on Arxiv today.
The two AI systems are trained to complete two specific tasks: The tourist bot must describe its surroundings to the guide bot, which then interprets the tourist’s location based on the description and use of a map.
Agents were only given the ability to move forward, left, or right at intersections within two city blocks. Tourist agents could only describe their location for the guide using a map with no street names.
Natural language used in the exercise was created from transcripts of text from humans who completed the same task.
“What sets this apart from those other datasets is we have actual natural language annotations, so it’s not some kind of artificially templated language, which other people have tried. This is the first instance where it’s real language with real visual perception,” Facebook AI research scientist Douwe Kiela told VentureBeat in a phone interview.
Talk the Walk involves two AI systems in a two-block radius in Hell’s Kitchen, East Village, Financial District, and Upper East Side in Manhattan, and the Williamsburg neighborhood in Brooklyn.
Complicating matters a bit, each of the neighborhoods follows a grid system so the maps have no distinctive qualities. A two-block radius with 16 different street corners may seem small; however, the original study started covering more ground but had to be reduced because it proved too hard for humans to complete.
“It’s an important task because it brings together a lot of different challenges that we need to solve if we want to make progress with AI research, so things like realistic 360 visual perception, map-based navigation, visual reasoning, natural language communication by dialogue — all of these things are important to solve problems in AI. And what this work is about is trying to bring all these problems together into an overarching, all-encompassing kind of solution,” Kiela said.
While 360 video and a map were part of input that trained the systems, the task and benchmark dataset is primarily geared toward the advancement of conversational AI, said Kiela, whose work has centered on grounding, the practice of using multimodal methods to develop natural language understanding.
To reach one another requires successful communication, both from the tourist telling the guide where it is with natural language and the guide that must interpret words generated by the tourist agent.
“The long term vision of this kind of research is improving natural language understanding, and so that of course is interesting to humankind. Basically, if we can achieve artificial intelligence where agents actually understand natural language, then that would be kind of a pivotal moment for AI, and I think we’re not even close to that yet,” he said. “I really care about this long term vision, first and foremost, of how can we get to this kind of language understanding and how can we get AI that really has this kind of common sense that has been missing up until now.”
An attention mechanism called Masked Attention for Spatial Convolution (MASC) was used to narrow the focus of the agents, and produced results that at times made the agents twice as likely to complete the task.
The resulting task and dataset were made to act as a benchmark. The work is being open-sourced so others in the AI community can advance the current state of machine understanding of human communication skills.
“This is a difficult challenge, and that’s also one of the reasons we’re open-sourcing it and inviting people to think about this kind of problem. In general we should have more hard challenges in AI research and difficult problems for the community to tackle and realize also what the limitations are of what we can currently do. And so the open-sourcing thing is important to us, and that’s why were happy to share with the scientific community,” he said. “In my opinion this really is the way forward with AI. If we don’t have this, then it’s going to look like we’re making a lot of progress, but we’re not really making the kind of progress that we should be making.”
To view or download the dataset, visit this code.fb.com website.