Researchers at Babylon Health, the well-funded U.K.-based startup that facilitates telemedical consultations between patients and health experts, claim they’ve developed an AI system capable of matching expert clinician decisions in 85% of cases. If it holds up to scrutiny, the system could help relieve the overloaded U.S. health care system, which is anticipated to face a shortfall of between 21,000 and 55,000 primary care doctors by 2023.
Triaging in this context refers to the process of uncovering enough medical evidence to determine the appropriate point of care for a patient. Clinicians plan a sequence of questions in order to make a fast and accurate decision, inferring the causes of a condition and updating their plan with each new piece of information.
The Babylon Health team sought an automated approach built on reinforcement learning, an AI training paradigm that spurs software agents to complete tasks via a system of rewards. They combined this with judgments from medical experts made over a data set of patient presentations that encapsulated roughly 597 elements of observable symptoms or risk factors.
The researchers’ AI agent — a Deep Q network — learned an optimized policy based on 1,374 expert-crafted clinical vignettes. Each vignette was associated with an average of 3.36 expert triage decisions made by separate clinicians, and the validity of each vignette was independently reviewed by two clinicians.
At each step, the agent asks for more information or makes one of four triage decisions. And with each new episode, the training environment is configured with a new clinical vignette. Then the environment processes evidence and triage decisions for the vignette and returns a value. If the agent picks a triage action, the system receives a final reward.
To validate this approach, the researchers evaluated the model on a test set of 126 previously unseen vignettes using three target metrics: appropriateness, safety, and the average number of questions asked (between 0 and 23). During training on 1,248 vignettes, those metrics were evaluated over a sliding window of 20 vignettes, and during testing they were evaluated over the whole test set.
The team reports that the best-performing model achieved an appropriateness score of 85% and a safety score of 93%, and it asked an average of 13.34 questions. That’s on par with the human baseline (84% appropriateness, 93% safety, and all 23 questions).
“By learning when best to stop asking questions, given a patient presentation, the [system] is able to produce an optimized policy [that] reaches the same performance as supervised methods while requiring less evidence. It improves upon clinician policies by combining information from several experts for each of the clinical presentations,” wrote the paper’s coauthors, who point out that the agent isn’t trained to ask specific questions and can be used in conjunction with any question-answering system. “This … approach can produce triage policies tailored to health care settings with specific triage needs.”
It’s worth noting that Babylon Health, which is backed by the U.K.’s National Health Service (NHS), has flirted with controversy. Nearly three years ago, it tried and failed to gain a legal injunction to block publication of a report from the NHS care standards watchdog. In February, it publicly attacked a U.K. doctor who raised around 100 test results he considered concerning. And it recently received a reprimand from U.K. regulators for misleading advertising.
The thoroughness of its studies has also been called into question.
The Royal College of General Practitioners, the British Medical Association, Fraser and Wong, and the Royal College of Physicians issued statements questioning claims made in a 2018 paper published by Babylon researchers that asserted its AI could diagnose common diseases as well as human physicians. “[There is no evidence it] can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse,” wrote the coauthors of a 2018 paper published in the Lancet. “Symptom checkers bring additional challenges because of heterogeneity in their context of use and experience of patients.”
In response to the criticism, Babylon said that “[s]ome media outlets may have misinterpreted what was claimed” but said it “[stood] by [its] original science and results.” It described the 2018 test as a “preliminary piece of work” that pitted the company’s AI against a “small sample of doctors,” and it referred to the study’s conclusion: “Further studies using larger, real-world cohorts will be required to demonstrate the relative performance of these systems to human doctors.”
In this latest paper, Babylon disclosed that the chief investigator and most coinvestigators were paid employees.