In a recent conversation, Facebook AI research scientist Moustapha Cissé told me, “You are what you eat, and right now we feed our models junk food.” Well, just like you can’t eat better if you don’t know what’s in your food, you can’t train less biased models if you don’t know what’s in your training data. That’s why the recent paper “Datasheets for Datasets” is so interesting. In it, Timnit Gebru and her coauthors from Microsoft Research and elsewhere propose the equivalent of food nutrition labeling for datasets.

Given that many machine learning and deep learning model development efforts use public datasets such as ImageNet or COCO — or private datasets produced by others — it’s important to be able to convey the context, biases, and other material aspects of a training dataset to those interested in using it. The “Datasheets” paper explores the idea of using standardized datasheets to communicate this information to users of datasets, commercialized APIs, and pre-trained models. In addition to helping to communicate data biases, the authors propose that such datasheets can improve transparency and provide a source of accountability.

Beyond potential ethical issues, hidden data biases can cause unpredictability or failures in deployed systems when models trained on third-party data fail to generalize adequately to different contexts. Of course, the best option is to collect first-party data and use models built and trained by experts with deep domain knowledge. But widely available public datasets, more approachable machine learning tools, and readily accessible AI APIs and prebuilt models are democratizing AI and enabling a broader group of developers to incorporate AI into their applications. The authors suggest that datasheets for AI datasets and tools could go a long way in providing essential information to engineers who might not have domain expertise, and in doing so help mitigate some of the issues associated with dataset misuse.

This perspective echoes similar thoughts from Clare Gollnick, CTO of information security firm Terbium Labs, in our discussion on the reproducibility crisis in science and AI. She expressed her concern for developers turning first to deeper, more complex models to solve their problems, noting that they often run into generalization issues when those models are moved into production. Rather, she finds that when researchers solve AI problems by capitalizing on some discovery found through a strong understanding of the domain at hand, the results are much more robust.

Gebru and her coauthors suggest in the paper that AI has yet to undergo the safety regulations of emergent industries of the past, like the automobile, medicine, and electrical industries. The paper points out:

When cars first became available in the United States, there were no speed limits, stop signs, traffic lights, driver education, or regulations pertaining to seat belts or drunk driving. Thus, the early 1900s saw many deaths and injuries due to collisions, speeding, and reckless driving.

Over the course of decades, the automobile industry and others iteratively developed regulations meant to protect the public good, while still allowing for innovation. The paper suggests that it’s not too early to start considering these types of regulations for AI, especially as we begin to use it in high-stakes applications like the health and public sectors. Such regulation will likely first apply to issues of privacy, bias, ethics, and transparency, and in fact, Europe’s impending General Data Protection Regulation (GDPR) takes on just these issues.

The proposed datasheets take cues from those associated with electrical components. Every electrical component sold has an accompanying datasheet that lists the component’s function, features, operating voltages, physical details, and more. These datasheets have become expected in the industry due to the need to understand a part’s behavior before purchase, as well as the liability issues that arise from a part’s misuse.

The authors suggest that those offering datasets or APIs should provide a datasheet that addresses a set of standardized questions covering the following topics:

  • The motivation for dataset creation
  • The composition of the dataset
  • The data collection process
  • The preprocessing of the data
  • The distribution of the data
  • The maintenance of the data
  • The legal and ethical considerations

For the full breakdown of all of the questions, check out the paper; it goes into a bunch of additional detail and provides an example datasheet for the UMass Labeled Faces in the Wild dataset. It’s a thorough and easy-to-use model that has the potential for big impact.

Datasheets such as this will allow users to understand the strengths and limitations of the data that they’re using and guard against issues such as bias and overfitting. One can also argue that simply having datasheets at all forces both dataset producers and consumers to think differently about their data sources and to understand that the data is not a de facto source of truth but rather a living, breathing resource that requires careful consideration and maintenance.

Maybe it’s the electrical engineer in me, but I think this is a really interesting idea.

The original version of this story appeared in the This Week in Machine Learning & AI newsletter. Copyright 2018.

Sam Charrington is host of the podcast This Week in Machine Learning & AI (TWiML & AI) and founder of CloudPulse Strategies.