3 essential abilities AI is missing

Throughout the past decade, deep learning has come a long way from a promising field of artificial intelligence (AI) research to a mainstay of many applications. However, despite progress in deep learning, some of its problems have not gone away. Among them are three essential abilities: To understand concepts, to form abstractions and to draw analogies — that's according to Melanie Mitchell, professor at the Santa Fe Institute and author of "Artificial Intelligence: A Guide for Thinking Humans."

During a recent seminar at the Institute of Advanced Research in Artificial Intelligence, Mitchell explained why abstraction and analogy are the keys to creating robust AI systems. While the notion of abstraction has been around since the term “artificial intelligence” was coined in 1955, this area has largely remained understudied, Mitchell says.

As the AI community puts a growing focus and resources toward data-driven, deep learning–based approaches, Mitchell warns that what seems to be a human-like performance by neural networks is, in fact, a shallow imitation that misses key components of intelligence.

From concepts to analogies

“There are many different definitions of ‘concept’ in the cognitive science literature, but I particularly like the one by Lawrence Barsalou: A concept is ‘a competence or disposition for generating infinite conceptualizations of a category,’” Mitchell told VentureBeat.

For example, when we think of a category like “trees,” we can conjure all kinds of different trees, both real and imaginary, realistic or cartoonish, concrete or metaphorical. We can think about natural trees, family trees or organizational trees.

“There is some essential similarity — call it ‘treeness’ — among all these,” Mitchell said. “In essence, a concept is a generative mental model that is part of a vast network of other concepts.”

While AI scientists and researchers often refer to neural networks as learning concepts, the key difference that Mitchell points out is what these computational architectures learn. While humans create “generative” models that can form abstractions and use them in novel ways, deep learning systems are “discriminative” models that can only learn shallow differences between different categories.

For instance, a deep learning model trained on many labeled images of bridges will be able to detect new bridges, but it won’t be able to look at other things that are based on the same concept — such as a log connecting two river shores or ants that form a bridge to fill a gap, or abstract notions of “bridge,” such as bridging a social gap.

Discriminative models have pre-defined categories for the system to choose among — e.g., is the photo a dog, a cat, or a coyote? Rather, to flexibly apply one's knowledge to a new situation, Mitchell explained.

“One has to generate an analogy — e.g., if I know about something about trees, and see a picture of a human lung, with all its branching structure, I don't classify it as a tree, but I do recognize the similarities at an abstract level — I am taking what I know, and mapping it onto a new situation,” she said.

Why is this important? The real world is filled with novel situations. It is important to learn from as few examples as possible and be able to find connections between old observations and new ones. Without the capacity to create abstractions and draw analogies—the generative model—we would need to see infinite training examples to be able to handle every possible situation.

This is one of the problems that deep neural networks currently suffer from. Deep learning systems are extremely sensitive to “out of distribution” (OOD) observations, instances of a category that are different from the examples the model has seen during training. For example, a convolutional neural network trained on the ImageNet dataset will suffer from a considerable performance drop when faced with real-world images where the lighting or the angle of objects is different from the training set.

Likewise, a deep reinforcement learning system trained to play the game Breakout at a superhuman level will suddenly deteriorate when a simple change is made to the game, such as moving the paddle a few pixels up or down.

In other cases, deep learning models learn the wrong features in their training examples. In one study, Mitchell and her colleagues examined a neural network trained to classify images between “animal” and “no animal." They found that instead of animals, the model had learned to detect images with blurry backgrounds — in the training dataset, the images of animals were focused on the animals and had blurry backgrounds while non-animal images had no blurry parts.

“More broadly, it's easier to ‘cheat’ with a discriminative model than with a generative model — sort of like the difference between answering a multiple-choice versus an essay question,” Mitchell said. “If you just choose from a number of alternatives, you might be able to perform well even without really understanding the answer; this is harder if you have to generate an answer.”

Abstractions and analogies in deep learning

The deep learning community has taken great strides to address some of these problems. For one, “explainable AI” has become a field of research for developing techniques to determine the features neural networks are learning and how they make decisions.

At the same time, researchers are working on creating balanced and diversified training datasets to make sure deep learning systems remain robust in different situations. The field of unsupervised and self-supervised learning aims to help neural networks learn from unlabeled data instead of requiring predefined categories.

One field that has seen remarkable progress is large language models (LLM), neural networks trained on hundreds of gigabytes of unlabeled text data. LLMs can often generate text and engage in conversations in ways that are consistent and very convincing, and some scientists claim that they can understand concepts.

However, Mitchell argues, that if we define concepts in terms of abstractions and analogies, it is not clear that LLMs are really learning concepts. For example, humans understand that the concept of “plus” is a function that combines two numerical values in a certain way, and we can use it very generally. On the other hand, large language models like GPT-3 can correctly answer simple addition problems most of the time but sometimes make “non-human-like mistakes” depending on how the problem is asked.

“This is evidence that [LLMs] don't have a robust concept of ‘plus’ like we do, but are using some other mechanism to answer the problems,” Mitchell said. “In general, I don't think we really know how to determine in general if a LLM has a robust human-like concept — this is an important question.”

Recently, scientists have created several benchmarks that try to assess the capacity of deep learning systems to form abstractions and analogies. An example is RAVEN, a set of problems that evaluate the capacity to detect concepts such as numerosity, sameness, size difference and position difference.

However, experiments show that deep learning systems can cheat such benchmarks. When Mitchell and her colleagues examined a deep learning system that scored very high on RAVEN, they realized that the neural network had found “shortcuts” that allowed it to predict the correct answer without even seeing the problem.

“Existing AI benchmarks in general (including benchmarks for abstraction and analogy) don't do a good enough job of testing for actual machine understanding rather than machines using shortcuts that rely on spurious statistical correlations,” Mitchell said. “Also, existing benchmarks typically use a random ‘training/test’ split, rather than systematically testing if a system can generalize well.”

Another benchmark is the Abstract Reasoning Corpus (ARC), created by AI researcher, François Chollet. ARC is particularly interesting because it contains a very limited number of training examples, and the test set is composed of challenges that are different from the training set. ARC has become the subject of a contest on the Kaggle data science and machine learning platform. But so far, there has been very limited progress on the benchmark.

“I really like Francois Chollet's ARC benchmark as a way to deal with some of the problems/limitations of current AI and AI benchmarks,” Mitchell said.

She noted that she sees promise in the work being done at the intersection of AI and developmental learning, or “looking at how children learn and how that might inspire new AI approaches.”

What will be the right architecture to create AI systems that can form abstractions and analogies like humans remains an open question. Deep learning pioneers believe that bigger and better neural networks will eventually be able to replicate all functions of human intelligence. Other scientists believe that we need to combine deep learning with symbolic AI.

What is for sure is that as AI becomes more prevalent in applications we use every day, it will be important to create robust systems that are compatible with human intelligence and work — and fail — in predictable ways.