For language models, analogies are a tough nut to crack, study shows

Analogies play a crucial role in commonsense reasoning. The ability to recognize analogies like "eye is to seeing what ear is to hearing," sometimes referred to as analogical proportions, shape how humans structure knowledge and understand language. In a new study that looks at whether AI models can understand analogies, researchers at Cardiff University used benchmarks from education as well as more common datasets. They found that while off-the-shelf models can identify some analogies, they sometimes struggle with complex relationships, raising questions about to what extent models capture knowledge.

Large language models learn to write humanlike text by internalizing billions of examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs. But studies demonstrate the pitfall of this training approach. Even sophisticated language models such as OpenAI's GPT-3 struggle with nuanced topics like morality, history, and law and often memorize answers found in the data on which they're trained.

Memorization isn't the only challenge large language models struggle with. Recent research shows that even state-of-the-art models struggle to answer the bulk of math problems correctly. For example, a paper published by researchers at the University of California, Berkeley finds that large language models including GPT-3 can only complete 2.9% to 6.9% of problems from a dataset of over 12,500.

Analogy dataset

The Cardiff University researchers used a test dataset from an educational resource that included analogy problems from assessments of linguistic and cognitive abilities. One subset of problems was designed to be equivalent to analogy problems on the Scholastic Aptitude Test (SAT), the U.S. college admission test, while the other set was similar in difficulty to problems on the Graduate Record Examinations (GRE). In the interest of thoroughness, the coauthors combined the dataset with an analogy corpus from Google and BATS, which includes a larger number of concepts and relations split into four categories: lexicographic, encyclopedic, derivational morphology, and inflectional morphology.

The word analogy problems are designed to be challenging. Solving them requires identifying nuanced differences between word pairs that belong to the same relation.

In experiments, the researchers tested three language models based on the transformer architecture, including Google's BERT, Facebook's RoBERTa, and GPT-2, the predecessor of GPT-3. The results show that difficult analogy problems, which are generally more abstract or contain obscure words (e.g., grouch, cantankerous, palace, ornate), present a major barrier. While the models could understand analogies, not all of the models achieved "meaningful improvement."

The researchers leave open the possibility that language models can learn to solve analogy tasks when given the appropriate training data, however. "[Our] findings suggest that while transformer-based language models learn relational knowledge to a meaningful extent, more work is needed to understand how such knowledge is encoded, and how it can be exploited," the coauthors wrote. "[W]hen carefully tuned, some language models are able to achieve state-of-the-art results."

Analogy dataset

More