DeepMind claims its AI predicts macular degeneration more accurately than experts

In 2018, Google Health and Alphabet's DeepMind released a peer-reviewed paper detailing an AI system that could recommend treatment for more than 50 eye diseases with 94% accuracy. Created in collaboration with Moorfields Eye Hospital NHS Foundation Trust and University College London (UCL) Institute of Ophthalmology, the underlying models ostensibly referred patients at a rate on par with human optometrists.

Now, in a follow-up study published in the journal Nature Medicine, DeepMind claims its system can not only spot one disease -- macular degeneration -- with high accuracy, but that it can predict the disease's progression within a six-month period. It's a lofty assertion in light of a Google-published whitepaper that found an eye disease-predicting system was impractical in the real world. The coauthors of this latest study, who say the system matches or outperforms human experts, note that it could be used to target preventative treatments and even identify novel indicators of macular degeneration.

There's motivation aplenty, particularly considering eye-analyzing AI has been shown to accurately predict conditions like diabetic retinopathy and glaucoma. Macular degeneration is the leading cause of blindness in the developed world; in the U.S. alone, an estimated 148,000 adults each year progress from the early, mild form of the condition to the sight-threatening late form known as exAMD. Once exAMD develops, sight is lost precipitously and often can't be fully restored, making the point of conversion from early to exAMD a critical moment in the management of the disease.

Early diagnoses

DeepMind's and Google Health's AI approaches the problem of predicting exAMD conversion from two angles. It first identifies the subtle, early signs of the conversion, and next it models the disease's future risk.

The system predicts the onset of conversion -- in this case, wet macular degeneration, which is characterized by blood vessels that grow under the retina and leak -- based on eye tissue in 3D optical coherence tomography scans, imaging tests that use light waves to take pictures of the retina. An AI model processes the scans and automatically labels potentially significant features, after which another model makes progression predictions from the labeled scans and a third model takes the original scans to make its own predictions. The predictions are then combined to assign a risk factor within a six-month window.

As the researchers explain, the inclusion of the second prediction model -- the one that operates on the raw retinal scans -- was motivated by research suggesting that features not yet captured by the segmentation model, such as reticular pseudodrusen (a type of lesion on the retina) and tissue reflectivity, signal early exAMD conversion. As for the six-month prediction window, it was chosen to enable the model to project at least two three-month follow-up appointment intervals ahead, a "clinically actionable" amount of time.

DeepMind researchers and coauthors tested the system with 2,795 patients (with an average age of 79 years old) who were diagnosed with wet macular degeneration in one eye across seven different Moorfields health sites in the U.K. The bulk of the data set was reserved for training and evaluating the system, while the remainder was used to evaluate the fully trained system's performance.

The system reached an operating characteristic curve (AUC) -- a common indicator of model quality for which 0.5 is pure chance and 1.0 is perfect -- of 0.745 on the test set and 0.886 compared with the ground truth of when ill patients received treatment. But in further tests, the researchers adjusted the sensitivity (the proportion of true positive cases; higher is generally better) and specificity (the proportion of false positives; lower is better) to achieve different clinical outcomes, accounting for factors like visit scheduling and treatments.

For instance, a "conservative" configuration (90% specificity, 34% sensitivity) corresponded to false positives (incorrect predictions) in only 9.6% of scans, which would likely lead a clinician to treat most patients. The percentage jumps to 43.4% at a "liberal" configuration (55% specificity, 80% sensitivity), which might make the clinician hesitate to administer treatments.

The researchers note that in the case of the 103 patients who experienced conversion in their other eye during the study, given the conservative configuration, the system produced true positives (accurate predictions) in at least one scan 40.8% of the time throughout the preceding six months. In the liberal configuration, it predicted at least one positive 77.7% of the time.

In a separate experiment, the researchers attempted to apply the system to predictions outside of the original six-month window. At the conservative and liberal configurations, 23.6% and 25.8% of all eyes with false-positive predictions, respectively, were "early" and ended up getting the conversion more than six months after the prediction. For patients with a follow-up of at least 24 months after initial prediction, the number of false-positive alerts within 24 months was 35.2% for the conservative configuration and 32.8% for the liberal configuration.

Human baseline

In an effort to establish a human expert baseline with which to compare the system's performance, the researchers randomly selected a portion of the test set, choosing at least one scan in the six months before the conversion appeared. For each case, they had three retinal specialists and three optometrists make two predictions about whether an eye would convert within six months: one from a single scan and another prediction from scans and accompanying historical scans, retinal snapshots called fundus images, and patient demographic and visual acuity data.

The experts performed better than chance, but their performance varied substantially, ranging in sensitivity from 18% to 56% and specificity from 61% to 93% for the single-scan task. When given additional information, sensitivity ranged from 8.5% to 41.5% -- an improvement -- while specificity reached 77.4% to 98.6%. And across all predictions, the experts disagreed between 18% and 52% of the time.

DeepMind says its system outperformed the majority of the experts in a balanced (i.e., not overly conservative or liberal) configuration, achieving higher performance than five and matching one (an optometrist) for the single-scan task. In cases where the experts had access to each patient's previous scans, fundus images, and additional clinical information, DeepMind's model again outperformed five while one person (a retinal specialist) matched its accuracy.

Future work

Despite the promise the results appear to hold for eye disease prediction, the researchers concede there's much work to be done.

While the data set used for training, testing, and validating the system was a clinically representative demographic from Moorfields Eye Hospital, it wasn't fully representative of a global population. Macular degeneration is multifactorial, with location, genetics, race, sex, and lifestyle factors such as smoking and diet known to play a role in risk. Moreover, the system was only tested on one type of scanner, meaning it might adapt poorly to scanners from other device manufacturers. And it doesn't account for differences in treatment regimes and other factors correlating with the number of scans a patient undergoes, nor tissue features that might have been missed during the scan-labeling step.

That said, the coauthors believe their work demonstrates the potential of AI in conversion diagnosis -- especially in light of the fact it's not a routine clinical task. Studies have explored various preventative treatments of exAMD, which involve regularly administered injections. But little evidence suggests that clinicians -- even the experts recruited for this study -- can consistently predict a patient's imminent exAMD conversion.

Of course, were the system to go through clinical trials and receive regulatory approval before making its way into production, it would have to overcome challenges beyond a lack of robustness and generalizability. In the aforementioned whitepaper, the rollout of Google's system was stymied by variation in the eye-screening process, which resulted in low-quality retinal images. Poor internet connectivity hampered things, as did patients' wariness of setting up follow-up appointments.

Moorfields will be a test case for this. Thanks to an earlier agreement with DeepMind, if clinical trials prove successful, the health system will be able to use the AI for free across all 30 of its hospitals and community clinics for an initial period of five years.