Some FDA-approved AI medical devices are not 'adequately' evaluated, Stanford study says

Some AI-powered medical devices approved by the U.S. Food and Drug Administration (FDA) are vulnerable to data shifts and bias against underrepresented patients. That's according to a Stanford study published in Nature Medicine last week, which found that even as AI becomes embedded in more medical devices -- the FDA approved over 65 AI devices last year -- the accuracy of these algorithms isn't necessarily being rigorously studied.

Although the academic community has begun developing guidelines for AI clinical trials, there aren't established practices for evaluating commercial algorithms. In the U.S., the FDA is responsible for approving AI-powered medical devices, and the agency regularly releases information on these devices including performance data.

The coauthors of the Stanford research created a database of FDA-approved medical AI devices and analyzed how each was tested before it gained approval. Almost all of the AI-powered devices -- 126 out of 130 -- approved by the FDA between January 2015 and December 2020 underwent only retrospective studies at their submission, according to the researchers. And none of the 54 approved high-risk devices were evaluated by prospective studies, meaning test data was collected before the devices were approved rather than concurrent with their deployment.

The coauthors argue that prospective studies are necessary, particularly for AI medical devices, because in-the-field usage can deviate from the intended use. For example, most computer-aided diagnostic devices are designed to be decision-support tools rather than primary diagnostic tools. A prospective study might reveal that clinicians are misusing a device for diagnosis, leading to outcomes that differ from what would be expected.

There's evidence to suggest that these deviations can lead to errors. Tracking by the Pennsylvania Patient Safety Authority in Harrisburg found that from January 2016 to December 2017, EHR systems were responsible for 775 problems during laboratory testing in the state, with human-computer interactions responsible for 54.7% of events and the remaining 45.3% caused by a computer. Furthermore, a draft U.S. government report issued in 2018 found that clinicians not uncommonly miss alerts -- some AI-informed -- ranging from minor issues about drug interactions to those that pose considerable risks.

The Stanford researchers also found a lack of patient diversity in the tests conducted on FDA-approved devices. Among the 130 devices, 93 didn't undergo a multisite assessment, while 4 were tested at only one site and 8 devices in only two sites. And the reports for 59 devices didn't mention the sample size of the studies. Of the 71 device studies that had this information, the median size was 300, and just 17 device studies considered how the algorithm might perform on different patient groups.

Partly due to a reticence to release code, datasets, and techniques, much of the data used today to train AI algorithms for diagnosing diseases might perpetuate inequalities, previous studies have shown. A team of U.K. scientists found that almost all eye disease datasets come from patients in North America, Europe, and China, meaning eye disease-diagnosing algorithms are less certain to work well for racial groups from underrepresented countries. In another study, researchers from the University of Toronto, the Vector Institute, and MIT showed that widely used chest X-ray datasets encode racial, gender, and socioeconomic bias.

Beyond basic dataset challenges, models lacking sufficient peer review can encounter unforeseen roadblocks when deployed in the real world. Scientists at Harvard found that algorithms trained to recognize and classify CT scans could become biased toward scan formats from certain CT machine manufacturers. Meanwhile, a Google-published whitepaper revealed challenges in implementing an eye disease-predicting system in Thailand hospitals, including issues with scan accuracy. And studies conducted by companies like Babylon Health, a well-funded telemedicine startup that claims to be able to triage a range of diseases from text messages, have been repeatedly called into question.

The coauthors of the Stanford study argue that information about the number of sites in an evaluation must be "consistently reported" in order for clinicians, researchers, and patients to make informed judgments about the reliability of a given AI medical device. Multisite evaluations are important for understanding algorithmic bias and reliability, they say, and can help in accounting for variations in equipment, technician standards, image storage formats, demographic makeup, and disease prevalence.

"Evaluating the performance of AI devices in multiple clinical sites is important for ensuring that the algorithms perform well across representative populations," the coauthors wrote. "Encouraging prospective studies with comparison to standard of care reduces the risk of harmful overfitting and more accurately captures true clinical outcomes. Postmarket surveillance of AI devices is also needed for understanding and measurement of unintended outcomes and biases that are not detected in prospective, multicenter trial."

More