Proper data hygiene critical as enterprises focus on AI governance

Today's artificial intelligence/machine learning algorithms run on hundreds of thousands, if not millions, of data sets. The high demand for data has spawned services that collect, prepare, and sell them.

But data's rise as a valuable currency also subjects it to more extensive scrutiny. In the enterprise, greater AI governance must accompany machine learning's growing use.

In a rush to get their hands on the data, companies might not always do due diligence in the gathering process -- and that can lead to unsavory repercussions. Navigating the ethical and legal ramifications of improper data gathering and use is proving to be challenging, especially in the face of constantly evolving legal regulations and growing consumer awareness about privacy and consent.

The role of data in machine learning

Supervised machine learning, a subset of artificial intelligence, feeds on extensive banks of datasets to do its job well. It "learns" a variety of images or audio files or other kinds of data.

For example, a machine learning algorithm used in airport baggage screening learns what a gun looks like by seeing millions of pictures of guns -- and millions not containing guns. This means companies need to prepare such a training set of labeled images.

Similar situations play out with audio data, says Dr. Chris Mitchell, CEO of sound recognition technology company Audio Analytic. If a home security system is going to lean on AI, it needs to recognize a whole host of sounds including window glass breaking and smoke alarms, according to Mitchell. Equally important, it needs to pinpoint this information correctly despite potential background noise. It needs to feed on target data, which is the exact sound of the fire alarm. It will also need non-target audio, which are sounds that are similar to -- but different from -- the fire alarm.

ML data headaches

As ML algorithms take on text, images, audio, and other various data types, the need for data hygiene and provenance grows more acute. As they gain traction and find new for-profit use cases in the real world, however, the provenance of related data sets is increasingly coming under the microscope. Questions companies increasingly need to be prepared to answer are:

Where is the data from?
Who owns it?
Has the participant in the data or its producer granted consent for use?

These questions place AI data governance needs at the root of ethical concerns and laws related to privacy and consent. If a facial recognition system scans people's faces, after all, shouldn't every person whose face is being used in the algorithm need to have consented to such use?

Laws related to privacy and consent concerns are gaining traction. The European Union's General Data Protection Regulation (GDPR) gives individuals the right to grant and withdraw consent to use their personal data, at any time. Meanwhile, a 2021 proposal from the European Union would set up a legal framework for AI governance that would disallow use of some kinds of data and require permission before collecting data.

Even buying datasets does not grant a company immunity from responsibility for their use. This was seen when the Federal Trade Commission slapped Facebook with a $5 billion fine over consumer privacy. One of the many prescriptions was a mandate for tighter control over third-party apps.

The take-home message is clear, Mitchell says: The buck starts and stops with the company using the data, no matter the data's origins. "It's now down to the machine learning companies to be able to answer the question: 'Where did my data come from?' It's their responsibility," Mitchell said.

Beyond fines and legal concerns, the strength of AI models depends on robust data. If companies have not done due diligence in monitoring the provenance of data, and if a consumer retracts permission tomorrow, extracting that set of data can prove to be a nightmare as AI channels of data use are notoriously difficult to track down.

The complicated consent landscape

Asking for consent is a good prescription, but one that's difficult to execute. For one thing, dataset use might be so far removed from the source that companies might not even know from whom to obtain consent.

Nor would consumers always know what they're consenting to, says Dr. James Giordano, director of the Program in Biosecurity and Ethics at the Cyber-SMART Center of Georgetown University and co-director of the Program in Emerging Technology and Global Law and Policy.

"The ethical-legal construct of consent, at its bare minimum, can be seen as exercising the rights of acceptance or refusal," Giordano said. "When I consent, I'm saying, 'Yes, you can do this.' But that would assume that I know what 'this' is."

This is not always practical. After all, the data might have originally been collected for some unrelated purpose, and consumers and even companies might not know where the trail of data breadcrumbs actually leads.

"As a basic principle, 'When in doubt, ask for consent' is a sensible strategy to follow," Mitchell said.

So, company managers need to ensure robust, well-governed data is the foundation of ML models. "It's rather simple," Mitchell said. "You've got to put the hard work in. You don't want to take shortcuts."

The role of data in machine learning

ML data headaches

The complicated consent landscape

More