MIT researchers have concluded that the well-known ImageNet data set has “systematic annotation issues” and is misaligned with ground truth or direct observation when used as a benchmark data set.

“Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for,” the researchers write in a paper titled “From ImageNet to Image Classification: Contextualizing Progress on Benchmarks.” “We believe that developing annotation pipelines that better capture the ground truth while remaining scalable is an important avenue for future research.”

When the Stanford University Vision Lab introduced ImageNet at the Conference on Computer Vision and Pattern Recognition (CVPR) in 2009, it was much larger than many previously existing image data sets. The ImageNet data set contains millions of photos and was assembled over the span of more than two years. ImageNet uses the WordNet hierarchy for data labels and is widely used as a benchmark for object recognition models. Until 2017, annual competitions with ImageNet also played a role in advancing the field of computer vision.

But after closely examining ImageNet’s “benchmark task misalignment,” the MIT team found that about 20% of ImageNet photos include multiple objects. Their analysis across multiple object recognition models revealed that having multiple objects in a photo can lead to a 10% drop in general accuracy. At the core of these issues, the authors said, are the data collection pipelines used to create large-scale image data sets like ImageNet.

“Overall, this [annotation] pipeline suggests that the single ImageNet label may not always be enough to capture the ImageNet image content. However, when we train and evaluate, we treat these labels as the ground truth,” report coauthor and MIT Ph.D. candidate Shibani Santurkar said in an International Conference on Machine Learning (ICML) presentation on the work. “Thus, this could cause a misalignment between the ImageNet benchmark and the real-world object recognition task, both in terms of features that we encourage our models to do [and] how we assess their performance.”

According to the researchers, an ideal approach for a large-scale image data set would be to collect images of individual objects in the world and have experts label them in exact categories, but that’s not cheap or easy to scale. Instead, ImageNet collected images from search engines and sites like Flickr. Images scraped from the internet search engine were then reviewed by annotators from Amazon’s Mechanical Turk. The researchers note that Mechanical Turk employees who labeled ImageNet photos were directed to focus on a single object and ignore other objects or occlusions. Other large-scale image data sets have followed a similar — and potentially problematic — pipeline, the researchers said.

To evaluate ImageNet, the researchers created a pipeline that asked human data labelers to choose from multiple labels and pick one that was most relevant to the photo. The most frequently selected label was then used to train models to determine what the researchers call an “absolute ground truth.”

“The key idea that we leverage is to actually augment the ImageNet labels using model predictions. Specifically, we take a wide range of models and aggregate their top five predictions to get a set of candidate labels,” Santurkar said. “Then we actually determine the validity of these labels by using human annotators, but instead of asking them whether a single label is valid, we repeat this process independently for multiple labels. This allows us to determine the set of labels that could be relevant for a single image.”

But the team cautions that their approach is not a perfect match for ground truth since they also used non-expert data labelers. They conclude that it can be difficult for human annotators who are not experts to accurately label images in some instances. Choosing from one of 24 breeds of terriers could be difficult unless you’re a dog expert, for example.

The team’s paper was accepted for publication at ICML this week after being initially published in late May. The paper’s presentation at the conference followed MIT’s decision to remove the 80 Million Tiny Images data set from the internet and ask researchers with copies of the data set to delete them. These measures were taken after researchers drew attention to offensive labels in the data set, like the N-word, as well as sexist terms for women and other derogatory labels. Researchers who audited the 80 Million Tiny Images data set, which was released in 2006, concluded that these labels were incorporated as a result of the WordNet hierarchy.

ImageNet also used the WordNet hierarchy, and in a paper published at the ACM FaccT conference, ImageNet creators said they plan to remove virtually all of about 2,800 categories in the person subtree of the data set. They also cited other problems with the data set, such as a lack of image diversity.

Beyond large-scale image data sets used to train and benchmark models, the shortcomings of large-scale text data sets was a key theme at the Association of Computational Linguistics (ACL) conference earlier this month.

In other ImageNet-related news, Richard Socher left his job as Salesforce chief scientist this week to launch his own company. Socher helped compile the ImageNet data set in 2009 and oversaw the launch of the first cloud AI services at the company, as well as overseeing Salesforce Research.


The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here