MIT study finds 'systematic' labeling errors in popular AI benchmark datasets

The field of AI and machine learning is arguably built on the shoulders of a few hundred papers, many of which draw conclusions using data from a subset of public datasets. Large, labeled corpora have been critical to the success of AI in domains ranging from image classification to audio classification. That's because their annotations expose comprehensible patterns to machine learning algorithms, in effect telling machines what to look for in future datasets so they're able to make predictions.

But while labeled data is usually equated with ground truth, datasets can -- and do -- contain errors. The processes used to construct corpora often involve some degree of automatic annotation or crowdsourcing techniques that are inherently error-prone. This becomes especially problematic when these errors reach test sets, the subsets of datasets researchers use to compare progress and validate their findings. Labeling errors here could lead scientists to draw incorrect conclusions about which models perform best in the real world, potentially undermining the framework by which the community benchmarks machine learning systems.

A new paper and website published by researchers at MIT instill little confidence that popular test sets in machine learning are immune to labeling errors. In an analysis of 10 test sets from datasets that include ImageNet, an image database used to train countless computer vision algorithms, the coauthors found an average of 3.4% errors across all of the datasets. The quantities ranged from just over 2,900 errors in the ImageNet validation set to over 5 million errors in QuickDraw, a Google-maintained collection of 50 million drawings contributed by players of the game Quick, Draw!

The researchers say the mislabelings make benchmark results from the test sets unstable. For example, when ImageNet and another image dataset, CIFAR-10, were corrected for labeling errors, larger models performed worse than their lower-capacity counterparts. That's because the higher-capacity models reflected the distribution of labeling errors in their predictions to a greater degree than smaller models -- an effect that increased with the prevalence of mislabeled test data.

Mislabeled images, like one breed of dog being confused for another or a baby being confused for a nipple.
Mislabeled text sentiment, like Amazon product reviews described as negative when they were actually positive.
Mislabeled audio of YouTube videos, like an Ariana Grande high note being classified as a whistle.

A previous study out of MIT found that ImageNet has "systematic annotation issues" and is misaligned with ground truth or direct observation when used as a benchmark dataset. The coauthors of that research concluded that about 20% of ImageNet photos contain multiple objects, leading to a drop in accuracy as high as 10% among models trained on the dataset.

In an experiment, the researchers filtered out the erroneous labels in ImageNet and benchmarked a number of models on the corrected set. The results were largely unchanged, but when the models were evaluated only on the erroneous data, those that performed best on the original, incorrect labels were found to perform the worst on the correct labels. The implication is that the models learned to capture systematic patterns of label error in order to improve their original test accuracy.

More