Microsoft researchers tap AI for anonymous data sharing for health care providers

The use of images to build diagnostic models of diseases has become an active research topic in the AI community. But capturing the patterns in a condition and an image requires exposing a model to a rich variety of medical cases. It's well-known that images from a source can be biased by demographics, equipment, and means of acquisition, which means training a model on such images would cause it to perform poorly for other populations.

In search of a solution, researchers at Microsoft and the University of British Columbia developed a framework called Federated Learning with a Centralized Adversary (FELICIA). It extends a family of a type of model called a generative adversarial network (GAN) to a federated learning environment using a "centralized adversary." The team says FELICIA could enable stakeholders like medical centers to collaborate with each other and improve models in a privacy-preserving, distributed data-sharing way.

GANs are two-part AI models consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples. As for federated learning, it entails training algorithms across decentralized devices holding data samples without exchanging those samples. Local algorithms are trained on local data samples and the weights, or learnable parameters of the algorithms, are exchanged between the algorithms at some frequency to generate a global model.

With FELICIA, the researchers propose duplicating the discriminator and generator architectures of a "base GAN" to other component generator-discriminator pairs. A privacy discriminator is selected to be nearly identical in design to the other discriminators, and most of the optimization effort is dedicated to training the base GAN on the whole training data to generate realistic -- but synthetic -- medical image scans.

In experiments, the researchers simulated two hospitals with different populations, considering a "very restrictive" regulation preventing sharing images, as well as models have that had access to images. The team used a dataset of handwritten digits (MNIST) to see whether FELICIA could help generate high-quality synthetic data even when both data owners have biased coverage. They also sourced a more complex dataset (CIFAR10) to show how the utility could be significantly improved when a certain type of image was underrepresented in the data. And they tested FELICIA in a federated learning setting with medical imagery using a popular skin lesion image dataset.

According to the researchers, the results of the experiments show that FELICIA has potentially wide application in health care research settings. For example, it could be used to augment an image dataset to improve diagnostics, like the classification of cancer pathology images. "The data from one research center is often biased toward the dominating population of the available data for training. FELICIA could help mitigate bias by allowing sites from all over the world to create a synthetic dataset based on a more general population," the researchers wrote in a paper describing their work.

In the future, the researchers plan to implement FELECIA with a GAN that can generate "highly complex" medical images, such as CT scans, X-rays, and histopathology slides in real-world federated learning settings with "non-local" data owners.

More