A new paper published by researchers affiliated with Facebook and Tel-Aviv University investigates whether machine learning language models can understand basic sets of instructions. The researchers propose a test dubbed the Turking Test to examine a model’s ability to follow natural language instructions. Despite what the researchers characterize as a lenient evaluation methodology, they observed that a pretrained language model performed poorly across all tasks.

One of the fundamental problems in AI is building a model that can generalize to previously unseen tasks. Recent work proposes a few-shot inference approach, in which a language model is conditioned on a few examples of a new task, followed by input for the model to process. This approach works well on a range of tasks, but the coauthors of this paper sought to determine whether language models could perform new tasks by conditioning them on instructions.

The Turking Test consists of instruction-following benchmarks of varying syntactic complexity, beginning with “turking” tasks, where a model must create valid examples of popular natural language processing datasets. (This is meant to simulate tasks commonly carried out by laypeople on crowdsourcing platforms like Amazon Mechanical Turk.) Another portion of the test tasks the model with listing all the nouns that satisfy a simple condition in a given sentence. To pass the Turking Test, the model must also write the Nth word or character in a given sentence.

The researchers applied the Turking Task to OpenAI’s GPT-2, a model with 1.5 billion parameters (variables internal to the model that shape its predictions). Overall, the results were disappointing. GPT-2 achieved only 2% accuracy on the task of writing the Nth word, something the authors note an elementary school student can easily do. The model also ignored explicit restrictions and conditions that appear in the instructions, achieving only slightly higher accuracy on open-ended tasks than on those with specific answers.

Turking Test

“Analyzing the model’s error patterns reveals that the model tends to ignore explicit instructions and often generates outputs that cannot be construed as an attempt to solve the task,” the researchers wrote. “The fact that such a large percentage of outputs is comprised of senseless repetitions indicates that the model fails to understand these trivial instructions. Even though these tasks are similar and have almost identical instructions, we find that their repetition patterns significantly differ, suggesting the model is hyper-sensitive to small changes in the instructions.”

Language models have much to learn if they’re going to converse like thoughtful humans one day. Beyond an apparent inability to follow instructions, they are also vulnerable to bias and struggle to grasp general knowledge. Research suggests that benchmarks such as XTREME don’t measure models’ knowledge well and that models like T-ULRv2 can exhibit toxicity and prejudice against specific demographic groups.

Bridging the gaps will likely require new techniques and approaches. Sam Altman is CEO of OpenAI, the firm behind GPT-2 and GPT-3 (its successor). Responding to public reactions to GPT-3, Altman recently said the “hype is way too much. It’s impressive, but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but [cutting-edge language models] are just a very early glimpse. We have a lot still to figure out.”