'Anonymized' X-ray datasets can reveal patient identities

Chest X-rays are used around the world to screen for diseases from pneumonia to COPD. But while they play a critical role in clinical care, discovering certain abnormalities in X-rays can be a challenging task for radiologists. That's given rise to AI-powered, X-ray analyzing disease classification systems, some of which have demonstrated promising performance. However, these systems require a large amount of patient data from which to learn to make diagnoses, which can have frightening privacy implications if the data isn't properly anonymized.

A study coauthored by researchers at the University Erlangen-Nurnberg in Erlangen, Germany sought to determine the extent to which patient data could be compromised by an X-ray classification system. Drawing on a public dataset of over 112,000 chest x-rays, they developed a technique -- a deep learning-based reidentification model -- that can identify whether two X-ray images are from the same person with 95.55% accuracy, suggesting that at least some datasets are vulnerable to attack.

As the researchers note, publicly available datasets that are supposedly anonymized might contain sensitive patient-related information, including diagnoses, treatment histories, and clinical institutions. If an X-ray of known person is accessible to a malicious attacker and a properly working reidentification model exists, then the model could be used to compare the given X-ray to each individual image in an X-ray dataset. In this way, a person could be linked to the sensitive data contained in the dataset.

The coauthors say their technique is robust to "non-rigid" transformations that might appear between two images of the same person in a public dataset, such as deformations in the shape of the lungs. They hypothesize that noisy image patterns characteristic to unique patients appear in the datasets, making people easier to identify. But even datasets that show little correlation between noise patterns and identities can be compromising, according to the coauthors.

"Reidentification is applicable for data that was acquired in various hospitals around the world where other preprocessing steps may be taken before data publication compared to the ChestX-ray14 dataset," the researchers wrote in a paper describing their work. "We conclude that publicly available medical chest X-ray data is not entirely anonymous. Using a deep learning-based reidentification network enables an attacker to compare a given radiograph with public datasets and to associate accessible metadata with the image of interest. Thus, sensitive patient data is exposed to a high risk of falling into the unauthorized hands of an attacker who may disseminate the gained information against the will of the concerned patient."

Data leakage of this kind would require an attacker to gain access to an image of a known person. However, even if an attacker has only a fraction of an image of an unknown patient, the researchers say their technique could be used to find the same patient across various datasets. Assuming multiple datasets contain the same patient but different metadata, an attacker might be able to obtain a complete picture of the patient.

Given the increased frequency of medical records breaches, this isn't an unrealistic scenario. In 2017, 27% of exploits were related to health care data in 2017. And in the first half of 2019 alone, more than 31 million patient records were breached -- twice the amount of breached records from 2018's total of 15 million.

"We hypothesize that collecting patient information by this means could significantly help an attacker infer the true identity of the patient," the researchers wrote. "We therefore urge that conventional anonymization techniques be reconsidered and that more secure methods be developed to resist the potential attacks by deep learning-based algorithms."

Solutions to these challenges in health care data will necessarily entail a combination of techniques, approaches, and paradigms. Securing data requires data-loss prevention, policy and identity management, and encryption technologies, including those that allow organizations to track actions that affect their data. On the privacy front, experts agree that transparency is the best policy -- deidentification capabilities that remove or obfuscate personal information are table stakes for health systems, as are privacy-preserving methods like differential privacy, federated learning, and homomorphic encryption.

"I think [federated learning] is really exciting research, especially in the space of patient privacy and an individual's personally identifiable information," Andre Esteva, head of medical AI at Salesforce Research, told VentureBeat in a previous interview. "Federated learning has a lot of untapped potential ... [it's] yet another layer of protection by preventing the physical removal of data from [hospitals] and doing something to provide access to AI that's inaccessible today for a lot of reasons."

More