Data de-identification: Best practices in the new age of regulation

This article is part of a VB special issue. Read the full series here: Building the foundation for customer data quality.

As organizations work to glean as much knowledge as possible from the massive trove of customer information available today, it is becoming increasingly important to de-identify that data as it moves around and between an organization, the third parties it works with and the various applications that consume the data — particularly those in the cloud.

Of course, healthcare professionals in the U.S. have been well aware of this imperative for years, having labored under the Healthcare Information Portability and Accountability Act (HIPAA) privacy standards since the mid-’90s. More recently, similar privacy concerns over personally identifiable information (PII) have become a top concern of regulators, consumers and companies around the globe.

IT research firm Gartner estimates that by the end of 2024, 75% of all consumer information globally will fall under some type of regulation. For its part, California has recently passed the California Consumer Privacy Act (CCPA) and the California Privacy Rights Act (CPRA), both dealing with consumer data and privacy. And the EU’s General Data Protection Rule (GDPR) is starting to be enforced with vigor.

Facebook’s recent $1.3B fine for moving data from the EU to the U.S. is a painful reminder that regulators are taking the issue seriously. Had that data been de-identified, the fine might never have been levied, said Joseph Williams, a partner in the cybersecurity practices at Infosys Consulting.

And then there is the reputational threat to organizations that do not at least give the appearance of protecting their customers’ personal information should the company ever be breached and the information end up in the hands of cyber-criminals. Cybersecurity professionals believe that most consumers have been the unwitting victims of a data breach in the last 10 years. Much of that data is for sale on the Dark Web.

Some would argue that any data de-identification work is really just an exercise in virtue signaling, given the ease with which individuals can be identified today from cross-correlating publicly available data, said Williams.

“When you start to blend the processing power of AI with what's out on the Dark Web and social media ... and open datasets, suddenly they can put everything into automatic discovery mode and come up with all kinds of interesting things,” he said. “And so the de-identification of data as a technology approach is a way [for regulators] to say, ‘We have imposed these burdens on these companies in order to protect your privacy.’”

Data de-identification techniques and practices

Virtue signaling or not, there are a lot of ways to de-identify data today, said Sameer Ansari, managing director and practice lead for the data privacy team at business consulting company Protiviti. The main challenge isn’t necessarily technical (although given the large volumes of structured and unstructured data to be de-identified, the task is challenging), it’s using the least disruptive technique to achieve the required results.

“Some of it depends on what the problem actually is,” Ansari said. “So, starting at why are you looking for a solution and what industry are you in, there might be use cases where you're saying, ‘Listen, masking [for example] is not an option.’ That's going to be the challenge. It's going to depend a lot on the use case, unfortunately.”

One technique being deployed today is redacting. This is where PII such as social security numbers, addresses and email addresses are either masked with symbols such as an asterisk or replaced with synthetic or fake data, explained Anshumali Shrivastava, an associate professor of computer science at Rice University and founder of ThirdAI Corp.

Aggregation, where datasets are generalized into groups such as by age range, are also popular and effective, he said.

Tokenization is a method that replaces the sensitive data with consistent replacement strings that have no meaningful value if breached, said Kayne McGladrey, a senior member of the IEEE, the world’s largest technical professional organization.

“One of the most common standards in the United States is the HIPAA Safe Harbor method, which requires the removal of all 18 identifiers of individuals, relatives, employers and household members,” he said.

Emerging trends in de-identification

The privacy vault method, where data passes through a “vault” to be de-identified, is gaining in popularity, said Infosys’ Williams. A vault can apply various de-identification techniques and relies on encryption keys to keep records from being re-identified after they pass through the vault.

“It wouldn't do that for all the data, but ... the privacy vault would mask [PII] to the customer support person [for example] who would be looking at my record. [The real data] would still be there, still be useful to the company, but ... there's no reason why the customer support person in the state of Washington needs to know my date of birth.”

Confidential computing also is an emerging technology meant to protect data in use, said McGladrey of the IEEE.

“Confidential computing can allow the processing of data from multiple parties without sharing the input data with those other parties,” he said. “For example, if an organization wants to perform processing on a large set of healthcare data collected from multiple third-party organizations, properly configured confidential computing potentially permits those third parties to provide their data for processing in aggregate. In this scenario, not even the cloud provider can see the cleartext data provided by the third parties, or the results.”

Another area of interest for de-identification advocates is synthetic data generation for research purposes, said Shrivastava. In this approach data is generated to mimic the real data it is replacing. Because the data retains the same statistical characteristics and patterns of the original information, data quality isn’t compromised. This method reduces the risks of exposing sensitive information when sharing datasets for scientific studies and research.

The challenge ahead

For most organizations today, data de-identification isn’t going to protect them from the fallout of a serious data breach, but it will help them ensure that the customer data they share during the normal course of business is protected from casual or uninformed misuse and exposure.

Fortunately, from a technical perspective there are many ways to do this, including using a service from an organization’s existing software vendor, such as Salesforce or Snowflake.

The main challenge most organizations will face is understanding where and when it is needed and, when it is, what method of de-identification will serve the purpose at hand without causing a ripple effect that breaks other business processes along the way.