Google's DeepMind developed an IQ test for AI models

Can machines learn to reason abstractly? That's the subject of a new paper from Google subsidiary DeepMind titled "Measuring abstract reasoning in neural networks," which was presented at the International Conference on Machine Learning in Stockholm, Sweden this week.

The researchers define abstract reasoning as the ability to detect patterns and solve problems on a conceptual level. In humans, they note, verbal, spatial, and mathematical reasoning can be measured empirically with tests that task subjects with teasing out the relationships between shape positions and line colors. But those tests aren't perfect.

"Unfortunately, even in the case of humans, such tests can be invalidated if subjects prepare too much, since test-specific heuristics can be learned that shortcut the need for generally applicable reasoning," the researchers explained. "This potential pitfall is even more acute in the case of neural networks, given their striking capacity for memorization."

The team's solution was a generator that creates questions involving an abstract set of factors, including relations like "progression" and attributes like "color" and "size." They constrained those factors to create different sets of problems -- for example, puzzles that revealed the progression relation only when applied to the color of lines -- with which to test and train machine learning models. Highly proficient algorithms, the thinking went, were more than likely capable of inferring concepts they'd never seen before.

Most of the models did well in testing, with some achieving performance as high as 75 percent; the researchers found that model accuracy was strongly correlated with the ability to infer the underlying abstract concepts of tasks. They managed to improve performance by training the models to "reason" for answers, predicting the relations and attributes that should be considered to solve the puzzle.

"[Some models] learned to solve complex visual reasoning questions," the team wrote, "and to do so, [they] needed to induce and detect from raw pixel input the presence of abstract notions such as logical operations and arithmetic progressions, and apply these principles to never-before-observed stimuli."

But even Wild Relation Network (WReN), the best-performing neural network, had its limits: It couldn't extrapolate attribute values that it didn't see during training, and it performed worse on generalization tasks when trained on previously seen relations (e.g., a progression on the number of shapes) or new attributes (size).

"Our results show that it might be unhelpful to draw universal conclusions about generalization: the neural networks we tested performed well in certain regimes of generalization and very poorly in others," the team wrote in a blog post. "Their success was determined by a range of factors, including the architecture of the model used and whether the model was trained to provide an interpretable 'reason' for its answer choices."

The end result might be a mixed bag, but the researchers aren't giving up just yet. They intend to explore strategies for improving generalization and to explore the use of "richly structured, yet generally applicable" inductive biases in future models.

More