Electronic medical records (EMRs) are a veritable treasure trove for data scientists, including those architecting AI to predict disease incidence, responses to treatment, and other patient outcomes. But EMRs are often distributed across geographic locations, which somewhat complicates analyses because the data sets must first be transmitted to the machine (or machines) on which the AI system resides.

Fortunately, researchers at MIT CSAIL, Harvard University Medical School, and Tsinghua University’s Academy of Arts and Design have developed what they believe to be one of the first federated — that is, decentralized — approaches to EMR model training. In a newly published paper (“Patient Clustering Improves Efficiency of Federated Machine Learning to predict mortality and hospital stay time using distributed Electronic Medical Records“) on the preprint server Arxiv.org, they describe an architecture that sources data from local hospitals, learns a model for each community, and aggregates the computed results on a server.

They say their technique not only reduces data transmission costs between the hospitals and the model-hosting server, it exposes dissimilarities among communities that otherwise might have escaped notice.

“Generated by individual patients and in diverse hospitals/clinics, EMRs are distributed and sensitive in nature. This may impede adoption of machine learning on EMRs in reality, and has entailed researchers to raise concerns on central storage of EMRs and on security, cost-effectiveness, privacy, and availability of medical data sharing,” the team wrote. “These concerns can be addressed by federated machine learning that keeps both data and computation local in distributed silos and aggregates locally computational results to train a global predictive model.”

To validate their approach, the researchers considered the critical care data of 200,859 patients admitted to 208 hospitals from across the U.S., with a focus on three dimensions: drugs administered to patients during the first 48 hours, discharge status (indicating patients’ condition after leaving the intensive care unit), and hospital stay time. Post-extraction, they were left with a data set of 126,490 patients from 58 hospitals, which they augmented by selecting 50 hospitals with a patient count of over 600 and randomly sampling 560 patients. This yielded a final corpus of 280,000 samples.

The scientists then grouped the 28,000 patients into five communities, based on shared features. (For instance, one community focused on neurologic and endocrine diseases, while another captured pulmonary, cardiovascular, and gastrointestinal diseases.) The team clustered these at the hospital level to reveal potential sources of bias. Some communities were larger than others, and with respect to geographic distribution, one captured mostly Southern hospitals, while another comprised hospitals in Western states.

With the preprocessed data in hand, the paper’s authors set about predicting two things: mortality and stay time. In experiments involving both the same and different hospitals in the training and test data sets, their algorithm achieved accuracy close to that of centralized learning, they say, and furthermore outperformed prior art with respect to every prediction task.

They note the limitation of their model — chiefly, its failure to consider more features and the shortcoming of its clustering methods. (It didn’t consider patient characteristics like age, weight, and height.) Still, they believe it’s an encouraging step toward a scalable, robust EMR analysis framework with few of the shortcomings of today’s most popular techniques.

“[Our work] could be extended to other biomedical informatics applications, such as medical image recognition or decision-making on medical planning across multiple health care silos with large, distributed, and privacy-sensitive data,” they wrote.