AI datasets are prone to mismanagement, study finds

Public datasets like Duke University's DukeMTMC are often used to train, test, and fine-tune machine learning algorithms that make their way into production, sometimes with controversial results. It's an open secret that biases in these datasets could negatively impact the predictions made by an algorithm, for example causing a facial recognition system to misidentify a person. But a recent study coauthored by researchers at Princeton reveals that computer vision datasets, particularly those containing images of people, present a range of ethical problems.

Generally speaking, the machine learning community now recognizes mitigating the harms associated with datasets as an important goal. But these efforts could be more effective if they were informed by an understanding of how datasets are used in practice, the coauthors of the report say. Their study analyzed nearly 1,000 research papers that cite three prominent datasets -- DukeMTMC, Labeled Faces in the Wild (LFW), and MS-Celeb-1M -- and their derivative datasets, as well as models trained on the datasets. The top-level finding is that the creation of derivatives and models and a lack of clarity around licensing introduces major ethical concerns.

Auditing datasets

DukeMTMC, LFW, and MS-Celeb-1M contain up to millions of images curated to train object- and people-recognizing algorithms. DukeMTMC draws from surveillance footage captured on Duke University's campus in 2014, while LFW has photos of faces scraped from various Yahoo News articles. MS-Celeb-1M, meanwhile, which was released by Microsoft in 2016, comprises the facial photos of roughly 10,000 different people.

Problematically, two of the datasets -- DukeMTMC and MS-Celeb-1M -- were used by corporations tied to mass surveillance operations. Worse still, all three contain at least some people who didn't give their consent to be included, despite Microsoft's insistence that MS-Celeb-1M featured only "celebrities."

In response to blowback, the creators of DukeMTMC and MS-Celeb-1M took down their respective datasets, while the University of Massachusetts, Amherst team behind LFW updated its website with a disclaimer prohibiting "commercial applications." However, according to the Princeton study, these retractions fell short of making the datasets unavailable and actively discouraging their use.

The coauthors found that offshoots of MS-Celeb-1M and DukeMTMC containing the entire original datasets remain publicly accessible. MS-Celeb-1M, while taken down by Microsoft, survives on third-party sites like Academic Torrents. Twenty GitHub repositories host models trained on MS-Celeb-1M. And both MS-Celeb-1M and DukeMTMC have been used in over 120 research papers 18 months after the datasets were retracted.

The retractions present another challenge, according to the study: a lack of license information. While the DukeMTMC license can be found in GitHub repositories of derivatives, the coauthors were only able to recover the MS-Celeb-1M license -- which prohibits the redistribution of the dataset or derivatives -- from an archived version of its now-defunct website.

Derivatives and licenses

Creating new datasets from subsets of original datasets can serve a valuable purpose, for example enabling new AI applications. But altering the compositions with annotations and post-processing can lead to unintended consequences, raising responsible use concerns, the Princeton researchers note.

For example, a derivative of DukeMTMC -- DukeMTMC-ReID, a "person re-identification benchmark" -- has been used in research projects for "ethically dubious" purposes. Multiple derivatives of LFW label the original images with sensitive attributes including race, gender, and attractiveness. SMFRD, a spin-off of LFW, adds face masks to its images -- potentially violating the privacy of those who wish to conceal their face. And several derivatives of MS-Celeb-1M align, crop, or "clean" images in a way that might impact certain demographics.

Derivatives, too, expose the limitations of licenses, which are meant to dictate how datasets may be used, derived from, and distributed. MS-Celeb-1M was released under a Microsoft Research license agreement, which specifies that users may "use and modify [the] corpus for the limited purpose of conducting non-commercial research." However, the legality of using models trained on MS-Celeb-1M data remains unclear. As for DukeMTMC, it was made available under a Creative Commons license, meaning it can be shared and adapted as long as (1) attribution is given, (2) it's not used for commercial purposes, (3) derivatives are shared under the same license, and (4) no additional restrictions are added to the license. But as the Princeton coauthors note, there's many possible ambiguities in a "non-commercial" designation for a dataset, like how nonprofits and governments can apply the dataset.

Recommendations

To address these and other ethical issues with AI datasets, the coauthors recommend that dataset creators be precise in license language about how datasets can be used and prohibit potentially questionable uses. They also advocate ensuring licenses remain available even if, like in the case of MS-Celeb-1M, the website hosting the dataset becomes unavailable.

Beyond this, the Princeton researchers say that creators should continuously steward a dataset, actively examine how it may be misused, and make updates to license, documentation, or access restrictions as necessary. They also suggest that dataset creators use "procedural mechanisms" to control derivative creation, for example, by requiring explicit permission to be obtained to create a derivative.

"At a minimum, dataset users should comply with the terms of use of datasets. But their responsibility goes beyond compliance," the coauthors wrote. "The machine learning community is responding to a wide range of ethical concerns regarding datasets and asking fundamental questions about the role of datasets in machine learning research. We provide a new perspective ... Through our analysis of the life cycles of three datasets, we showed how developments that occur after dataset creation can impact the ethical consequences, making them hard to anticipate a priori."

Auditing datasets

Derivatives and licenses

Recommendations

More