A deep dive into privacy-protecting databases

Privacy-protecting databases use a number of techniques to guard data. The complexity of these techniques has evolved as threats to data privacy have risen dramatically.

The simplest way to protect individuals' records in databases may be to assign digital pseudonyms that can be stored in a separate database. Researchers are given only the first database, with the pseudonyms relieving them of the obligation to protect people's real names. The database with real names may be stored in a second, more carefully protected location -- or even completely discarded.

More sophisticated approaches use encryption or a one-way function to compute the pseudonym. This can give users the ability to retrieve their information from the database by reconstructing the pseudonym. But anyone who accesses the database can't easily match the records up with names. My well-aged book, Translucent Databases, explored a number of different approaches to this, and there have been many innovations since then.

Some of the most complicated solutions are called "homomorphic encryption." In these systems, sensitive information is completely encrypted, but complex algorithms are specially designed to allow some basic operations without decryption. For example, some computers may add up a list of numbers from an accounting database without being able to unscramble associated protected values.

Homomorphic encryption is far from mature. Many of the early systems require too much computation to be practical, especially for large databases with many entries. They often require the encryption algorithms to be customized for the data analysis that might come. Still, mathematicians are doing exciting work in the area, and many recent innovations have dramatically reduced the workload involved.

In recent years, researchers have started seriously exploring how adding fake entries or shifting values by adding random noise can make it harder to identify individuals in a database. But if the noise is mixed in correctly, it will cancel out when computing some aggregated statistics, like averages -- a technique referred to as "differential privacy."

What are some use cases?

Saving time and money on security by removing more valuable data. A local version of the database stored at a branch may delete names to remove the danger of loss. The central database can keep complete records for compliance in a more secure building.
Sharing data with researchers. If a business or a school wants to cooperate with a research program, it may ship a version of the database with obscured personal information while withholding a complete version if it's ever necessary to discover the correct name connected to a record.
Encouraging compliance with rules for record-keeping while also maintaining customers' privacy.
Offering strategic protection for military operations while also sharing sufficient data with allies for planning.
A commerce system designed to minimize the danger of insider trading while still tracking all transactions for compliance and settlement.
A fraud detecting accounting system that balances disclosure with privacy.

Vendor approaches to encryption

Makers of established databases have long experimented with employing database encryption algorithms that scan and scramble the data in particular rows and columns so they can only be viewed by someone with the right access key. These encryption algorithms can protect privacy, but many approaches for protecting privacy try to avoid blanket encryption. The goal is to balance secrecy with sharing, to protect private information while revealing non-private information to researchers.

Often encryption algorithms are used as a component of this strategy. Personal information, like names and addresses, are encrypted, and the key for this encryption algorithm is only kept by trusted insiders. Other users receive access to the unencrypted sections.

One common technique involves using one-way functions like the SHA256 hash algorithm to create keys for particular records. Anyone can store and retrieve their personal information because they can compute the key for the data by hashing their name, for example. But attackers who might be browsing the data can't reverse the one-way function to recover the name.

Lately, that option doesn't require encryption, at least directly. Sometimes fake data is mixed into the database, and other times the actual data values are distorted by a small amount. Identifying the records of individual people becomes difficult because of the noise.

Some companies are extending their product line with libraries that add differential privacy to data collections. Google recently open-sourced its internal tool called Privacy-on-Beam, a collection of libraries written in C++, Go, and Java. Users can inject noise before or after storing the information in a Google Cloud database.

Microsoft also recently offered a differential privacy toolkit that was developed in collaboration with computer scientists at Harvard. The team demonstrated how the tool can be employed for a variety of use cases, like sharing a dataset used for training an artificial intelligence application or computing statistics used for planning marketing campaigns.

Oracle has also been exploring using the algorithms to help protect interactions with researchers training a machine learning algorithm. One recent use case explores mixing differential privacy algorithms with federated learning that works with a distributed database.

Is open source a way forward?

Many of the early explorers of differential privacy are working together on an open source project called OpenDP. It aims to build a diverse collection of algorithms that share a common framework and data structure. Users will be able to combine multiple algorithms and build a layered approach to protecting the data.

Another approach concentrates on auditing and fixing any data issues. The Privacera platform's suite of tools can search through files to identify and mask personally identifiable information (PII). It deploys a collection of machine learning techniques, and the tools are integrated with cloud APIs to simplify deployment across multiple clouds and vendors.

For more than a decade, IBM has been shipping homomorphic encryption. The company offers toolkits for Linux, iOS, and MacOS to accommodate developers who want to incorporate homomorphic encryption into their software. The company also offers consulting services and a cloud environment for storing and processing the data securely.

Is there anything privacy-protecting databases can't do?

The underlying math is often unimpeachable, but there can be many other weak links in the systems. Even if the algorithms don't have any known weak spot, attackers can sometimes find vulnerabilities.

In some cases, bad actors simply attack the operating system. In others, they go after the communications layer. Some sophisticated attacks combine information from multiple sources to reconstruct the hidden data inside.

But using privacy-protecting techniques on data continues to provide another layer of assurance that can simplify compliance. It can also enable types of collaboration that wouldn't be possible without it.