Researchers investigate why popular AI algorithms classify objects by texture, not by shape

In a paper accepted to the 2020 NeurIPS conference, Google and Stanford researchers explore the bias exhibited by certain kinds of computer vision algorithms -- convolutional neural networks (CNNs) -- trained on the open source ImageNet dataset. Unlike humans, ImageNet-trained CNNs tend to classify images by texture rather than by shape. Their work indicates that CNNs' bias toward textures may arise not from differences in their internal workings but from differences in the data that they see.

CNNs attain state-of the-art results in computer vision tasks including image classification, object detection, and segmentation. Although their performance in several of these tasks approaches that of humans, recent findings show that CNNs differ in key ways from human vision. For example, recent work compared humans to ImageNet-trained CNNs on a dataset of images with conflicting shape and texture information (e.g. an elephant-textured knife), concluding that models tend to classify according to material (e.g. "checkered") and humans to shape (e.g. "circle").

The Google and Stanford team discovered that "naturalistic" data augmentation involving color distortion, noise, and blur can decrease this CNN texture bias, whereas "random-crop" augmentation increases the bias. Combining these observations, they trained models that classify ambiguous images by shape a majority of the time. These models also ostensibly outperform baselines on datasets that exemplify different notions of shape.

CNN model architectures that perform better on ImageNet generally have less texture bias, according to the researchers, but architectures designed to match the human visual system don't have biases substantially different from ordinary CNNs. In the course of experimentation, the researchers also discovered that it's possible to extract more shape information from a CNN than is reflected in the model's classifications.

As the coauthors note, people who build and interact with tools for computer vision -- especially those without extensive training in machine learning -- often have a mental model of computer vision models that's similar to human vision. But the paper's findings build on a body of work showing this view is incorrect. Differences between human and machine vision of the kind the coauthors studied could cause data scientists to make significant errors in anticipating and reasoning about the behavior of computer vision systems. They advocate allowing people from a range of backgrounds to make safe, predictable, and equitable models requiring vision systems to perform at least roughly in accordance with their expectations.

"Making computer vision models that share the same inductive biases as humans is an important step towards this goal," the researchers wrote. "At the same time, we recognize the possible negative consequences of blindly constraining models' judgments to agree with people's: human visual judgments display forms of bias that should be kept out of computer models. More broadly, we believe that work like ours can have a beneficial impact on the internal sociology of the machine learning community. By identifying connections to developmental psychology and neuroscience, we hope to enhance interdisciplinary connections across fields, and to encourage people with a broader range of training and backgrounds to participate in machine learning research."