Can privacy and security be preserved in the course of large-scale textual data analysis? As it turns out, yes. A team of Amazon researchers in a recently published study proposed a way to anonymize customer-supplied data. They claim that their approach, which works by rephrasing samples and basing the analysis on the new phrasing, results in at least 20-fold greater guarantees on expected privacy.
“Questions about data privacy are frequently met with the answer ‘It’s anonymized! Identifying features have been scrubbed!’ However, studies … show that attackers can deanonymize data by correlating it with ‘side information’ from other data sources,” Tom Diethe, machine learning manager in the Amazon Alexa Shopping organization, wrote in a blog post.
The researchers’ differential privacy solution involved adding noise to make data related to specific people more difficult to trace. (This noise resulted in a loss of accuracy, but the expectation is that as the size of the data set grows, the trade-off between usefulness and privacy will become more manageable.) As Diethe notes, differential privacy provides a statistical assurance that aggregate data won’t leak information about which individuals are in the data set. Given the result of an analysis, the probabilities that either of the two data sets was the basis of the analysis should be virtually identical.
The difference between probabilities is controlled by a parameter — epsilon — that must be determined in advance. With metric differential privacy — the type of differential privacy applied by the Amazon team — this parameter is equal to epsilon multiplied by the distance between the two data sets according to some metric, such that the data sets are increasingly difficult to distinguish the more similar they become.
The study’s coauthors analyzed the privacy implications of different choices of epsilon value in the context of natural language processing, where word embeddings — mappings from the space of words into a vector (i.e., the space of real numbers) — often depend on the frequency with which words co-occur. This presents a challenge where differential privacy is concerned, because adding noise in the space of a word embedding usually produces a new point in the representational space that’s not the location of a valid word embedding. A search must be performed to find the nearest valid embedding once such a point is identified.
The team attempted to predict, given an epsilon value, the likelihood a word in a string of words will be overwritten by noise and the number of semantically related words that will fall within a fixed distance of words in the embedding space. The aforementioned space is a hyperbolic space, where the distance between embeddings indicates semantic similarity and where the different curvature at different locations indicates where embeddings fall in a semantic hierarchy. (For instance, the embeddings of the words “ibuprofen,” “medication,” and “drug” might lie near each other in the space, but their positions along the curve indicate which are more specific terms and which more general.)
According to the researchers, this technique ensures more general terms are substituted for more specific ones and thus makes personal data harder to extract. They intend to expound upon their findings at the ACM Web Search and Data Mining (WSDM) Conference next month in Houston, Texas.