Researchers create new model to evaluate AI-generated navigation instructions

A burgeoning subfield of AI focuses on leveraging models to improve the performance of robots that follow instructions given by people. These models generate directions (e.g., "Walk up the stairs and enter the first room on the left") that ostensibly improve robots' navigation performance in simulated and real-world environments. But a study coauthored by Google researchers finds that the models perform only slightly better than template-based techniques that don't rely on AI. Moreover, the coauthors assert that natural language benchmarks including BLEU, ROUGE, METEOR, and CIDEr are ineffective for evaluating the quality of the navigation instructions that the models generate.

Robots that follow natural language instructions could be useful in a range of settings, like industrial warehousing, where workers might not have free hands to man controls. They're also a potential fit for care facilities like nursing homes, where patients and health care providers could instruct robots to perform tasks with verbal commands. Former Misty Robotics CEO Tim Enwall predicted that every home and office will have a robot within 20 years. On the other hand, realists like Ken Goldberg, a professor at the University of California, Berkeley, anticipate it’ll be 5-10 years before we see a mass-produced home robot that can pick up after kids, tidy furniture, prep meals, and carry out other domestic chores.

The Google coauthors claim their experiments show efforts to improve navigation instruction generators have been hindered by a lack of suitable evaluation metrics. With the exception of SPICE, an imaging captioning benchmark first proposed by Australian National University- and Macquarie University-affiliated researchers, the coauthors found that none of the standard metrics correlated with the outcomes of human wayfinding attempts.

"Existing instruction generators have substantial headroom for improvement," the researchers wrote in a paper detailing their work. "Our results are a timely reminder that textual evaluation metrics should always be validated against human judgments when applied to new domains."

To address this problem, the researchers developed an "instruction-trajectory compatibility" model they claim outperforms existing automatic evaluation metrics without needing reference instructions. They say it can be used in a reinforcement learning setting or to suss out high-quality filtering navigation instructions, among other use cases.

"People [but not machines] are resourceful and may succeed in following poor quality instructions by expending additional effort ... Progress in natural language generation is increasing the demand for evaluation metrics that can accurately evaluate generated text in a variety of domains," the researchers wrote. "Generating grounded navigation instructions is one of the most promising directions for improving the performance of ... wayfinding [robots], and a challenging and important language generation task in its own right."