Researchers say we need better benchmarks to build more useful AI assistants

The promise of conversational AI is that, unlike virtually any other form of technology, all you have to do is talk. Natural language is the most natural and democratic form of communication. After all, humans are born capable of learning how to speak, but some never learn to read or use a graphical user interface. That's why AI researchers from Element AI, Stanford University, and CIFAR recommend academic researchers take steps to create more useful forms of AI that speak with people to get things done, including the elimination of existing benchmarks.

"As many current [language user interface] benchmarks suffer from low ecological validity, we recommend researchers not to initiate incremental research projects on them. Benchmark-specific advances are less meaningful when it is unclear if they transfer to real LUI use cases. Instead, we suggest the community to focus on conceptual research ideas that can generalize well beyond the current datasets," the paper reads.

The ideal way to create language user interfaces (LUIs), they say, is to identify a group of people who would benefit from its use, collect conversations and corresponding programs or actions, train a model, then ask users for feedback.

The paper, titled "Towards Ecologically Valid Research on Language User Interfaces," was published last week on preprint repository arXiv and promotes the creation of practical language models that can help people in their professional or personal lives. It identifies common shortcomings in existing popular benchmarks like SQuAD, which does not focus on working with target users, and CLEVR, which uses synthetic language.

Examples of speech interface challenges that academic researchers could pursue instead, authors say, include AI assistants that can talk with citizens about government data or benchmarks for popular games like Minecraft. Facebook AI Research released data and code to encourage the development of a Minecraft assistant last year.

Some governments have explored the use of conversational AI to help guide citizens through important moments in life or navigating government services. The Computing Community Consortium (CCC) recommends the development of lifelong intelligent assistants to do things like help people through their daily tasks or help them adapt to big changes like a new job or hobby.

The paper's authors focus on language user interfaces such as an AI that can act as a personal assistant or speech interface for interacting with a home robot, but they draw a distinction between LUIs and AI models made for specific events like the Alexa Prize challenge, which rewards bots capable of holding a conversation with a human for 10 minutes.

Researchers identified a number of problematic characteristics among LUI benchmarks, such as the use of artificial tasks that can take place in environments not directly associated with the use case of the language model or the employment of synthetic language.

Some refer to using Amazon Mechanical Turk employees, a source of human labor AI researchers increasingly seem to rely on, as "ghost work." The authors criticize it as a bad practice because the workers are not considered a potential user of LUIs.

One example of failure to work with a target population mentioned in the paper comes from the visual question-answering (VQA) task to train an AI system to recognize objects and words. The VQA data set is made up of questions humans think may stump a home robot. It gathers questions from Mechanical Turk employees but does not include questions from people who are blind or visually impaired, even though the data set was made in part to assist the visually impaired. The researchers conclude, "the population that would actually benefit from the language user interface rarely participates in the data collection effort."

The VizWiz VQA project found that people with visual impairments may ask questions differently, often asking questions that begin with "What" or that require the ability to read text. LUIs differ from conversational AI interfaces made for typed SMS or chat exchanges because people can word things differently when they speak as opposed to type. Scripted exchanges can also lead to the phenomena in which the human learns the exact words a speech interface or AI assistant needs to hear in order to operate rather than using their own natural language, which defeats the purpose of creating natural language models in the first place.

Some benchmarks also lack multi-turn dialogue, which the authors also criticized. Multiple studies have found that people using AI to accomplish concrete tasks respond best to multi-turn dialogue, the ability to ask multiple questions or engage in dialogue instead of issuing a series of single, separate commands.

In other recent news in language models, Microsoft researchers said this week they created advanced NLP for health care professionals, and last month researchers developed a method for identifying bugs in cloud AI offerings from major companies like Amazon, Apple, and Google.

More