Data labeling for AI research is highly inconsistent, study finds

Supervised machine learning, in which machine learning models learn from labeled training data, is only as good as the quality of that data. In a study published in the journal Quantitative Science Studies, researchers at consultancy Webster Pacific and the University of California, San Diego and Berkeley investigate to what extent best practices around data labeling are followed in AI research papers, focusing on human-labeled data. They found that the types of labeled data range widely from paper to paper and that a "plurality" of the studies they surveyed gave no information about who performed labeling -- or where the data came from.

While labeled data is usually equated with ground truth, datasets can -- and do -- contain errors. The processes used to build them are inherently error-prone, which becomes problematic when these errors reach test sets, the subsets of datasets researchers use to compare progress. A recent MIT paper identified thousands to millions of mislabeled samples in datasets used to train commercial systems. These errors could lead scientists to draw incorrect conclusions about which models perform best in the real world, undermining benchmarks.

The coauthors of the Quantitative Science Studies paper examined 141 AI studies across a range of different disciplines, including social sciences and humanities, biomedical and life sciences, and physical and environmental sciences. Out of all of the papers, 41% tapped an existing human-labeled dataset, 27% produced a novel human-labeled dataset, and 5% didn't disclose either way. (The remaining 27% used machine-labeled datasets.) Only half of the projects using human-labeled data revealed whether the annotators were given documents or videos containing guidelines, definitions, and examples they could reference as aids. Moreover, there was a "wide variation" in the metrics used to rate whether annotators agreed or disagreed with particular labels, with some papers failing to note this altogether.

Compensation and reproducibility

As a previous study by Cornell and Princeton scientists pointed out, a major venue for crowdsourcing labeling work is Amazon Mechanical Turk, where annotators mostly originate from the U.S. and India. This can lead to an imbalance of cultural and social perspectives. For example, research has found that models trained on ImageNet and Open Images, two large, publicly available image datasets, perform worse on images from Global South countries. Images of grooms are classified with lower accuracy when they come from Ethiopia and Pakistan compared to images of grooms from the U.S.

For annotators, labeling tasks tend to be monotonous and low-paying -- ImageNet workers made a median of $2 per hour in wages. Unfortunately, the Quantitative Science Studies survey shows that the AI field leaves the issue of fair compensation largely unaddressed. Most publications didn't indicate what type of reward they offered to labelers or even include a link to the training dataset.

Beyond doing a disservice to labelers, the lack of links threatens to exacerbate the reproducibility problem in AI. At ICML 2019, 30% of authors failed to submit code with their papers by the start of the conference. And one report found that 60% to 70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were often simply memorizing answers.

"Some of the papers we analyzed described in great detail how the people who labeled their dataset were chosen for their expertise, from seasoned medical practitioners diagnosing diseases to youth familiar with social media slang in multiple languages. That said, not all labeling tasks require years of specialized expertise, such as more straightforward tasks we saw, like distinguishing positive versus negative business reviews or identifying different hand gestures," the coauthors of the Quantitative Science Studies paper wrote. "Even the more seemingly straightforward classification tasks can still have substantial room for ambiguity and error for the inevitable edge cases, which require training and verification processes to ensure a standardized dataset."

Moving forward

The researchers avoid advocating for a single, one-size-fits-all solution to human data labeling. However, they call for data scientists who choose to reuse datasets to exercise as much caution around the decision as they would if they were labeling the data themselves -- lest bias creep in. An earlier version of ImageNet was found to contain photos of naked children, porn actresses, and college parties, all scraped from the web without those individuals' consent. Another popular dataset, 80 Million Tiny Images, was taken offline after an audit surfaced racist, sexist, and otherwise offensive annotations, such as nearly 2,000 images labeled with the N-word and labels like "rape suspect" and "child molester."

"We see a role for the classic principle of reproducibility, but for data labeling: does the paper provide enough detail so that another researcher could hypothetically recruit a similar team of labelers, give them the same instructions and training, reconcile disagreements similarly, and have them produce a similarly labeled dataset?" the researchers wrote. "[Our work gives] evidence to the claim that there is substantial and wide variation in the practices around human labeling, training data curation, and research documentation ... We call on the institutions of science -- publications, funders, disciplinary societies, and educators -- to play a major role in working out solutions to these issues of data quality and research documentation."