Mihaela van der Schaar, a Turing Fellow and professor of ML, AI, and health at the University of Cambridge and UCLA, believes that when it comes to applying AI to health care, we need a new way to think about the “complex class of problems” it presents. “We need new problem formulations. And there are many ways to conceive of a problem in medicine, and there are many ways to formalize the conception,” said van der Schaar during her keynote (login required) at the all-digital ICLR 2020 conference last week.

She drew a distinction between the problems AI is generally good at solving and those the field of medicine poses, especially when there are resource constraints. “Machine learning has accomplished wonders on well-posed problems, where the notion of a solution is well-defined and solutions are verifiable,” she said. For example, ML can determine if an image is of a cat or a dog, and a human can easily check the answer. It’s also adept at games like Go, where there’s a clear winner at the end.

“But medicine is different,” she said. “Problems are not well-posed, and the notion of a solution is often not well-defined. And solutions are hard to verify.” She pointed to the example of two seemingly similar cancer patients whose outcomes are different, and different over time. “It is difficult to understand, as well as to verify,” she said.

But van der Schaar sees enormous opportunities for AI to positively affect health care. In her keynote, and throughout the AI for Affordable Healthcare workshop that ran during the conference, those problems emerged as themes — data issues, resource issues, and the need for AI model explainability and human expertise. And researchers discussed both broad and specific ways they’re tackling them.

Challenges in health care

Medical data is notoriously problematic. In addition to the standard concerns over bias, there are inconsistencies in how medical data is collected, labeled, annotated, and handled from hospital to hospital and country to country. A great deal of medical data comes in the form of images, like X-rays and CT scans, and the quality of those images may vary widely depending on the machines available. Data is also often simply missing from health records.

That ties into the problem of scarce or unavailable resources. People who live near well-funded, top-tier hospitals may often have a homogeneous (and false) view of the resources and tools other hospitals have at their disposal. For example, high-end CT scanners can produce clearer, crisper images than older or lower-end machines. The expertise of medical staff at a well-resourced hospital versus a less-resourced one can vary just as widely, so interpreting results from tests like medical scans depends on both the quality of the image and who’s looking at it.

Test results, and recommended follow-up, are not as cleanly objective as we’d like to think. A tiny white spot on a mammogram could be a microcalcification — or just an artifact of a noisy image. The task of interpreting that image requires a skilled clinician. Ideally, AI can help, especially when the available clinician doesn’t possess as much expertise. Either way, the interpretation of that scan is one decision that leads the patient down a road where they and their health care providers will have more decisions to make, including for treatment and potential lifestyle changes.

For any clinician, the output of an AI model needs to be accurate, of course — but also explainable, interpretable, and trustworthy. Instead, there are often tradeoffs between explainability and accuracy, and it’s critically important to get that balance right.

Google recently discovered the ramifications of AI that works in the lab but struggles in the face of these real-life challenges.

Solutions begin by thinking “meta”

Van der Schaar’s approach to solving these difficult problems using ML involves many specific techniques and models, but fundamentally, much of it is about not getting too bogged down in one problem related to one disease. Creating a model for each disease is too inefficient, she said in her talk, so she advocates making a set of automated ML (AutoML) methods that can do so at scale. “We are going to make machine learning do the crafting,” she said.

After rattling off several AutoML models that have been used in the past, and explaining the shortcomings of each, van der Schaar suggested AutoPrognosis, a tool she and coauthor Ahmed M. Alaa detailed in a paper from 2018. AutoPrognosis, as they describe it, is “a system for automating the design of predictive modeling pipelines tailored for clinical prognosis.” The idea is that instead of trying to find a single best predictive modeling pipeline, clinicians should use “ensembles” of pipelines.

It’s a complex and layered approach, but even this, van der Schaar noted, is not enough — it provides prediction but lacks interpretability. “We do not need to have only predictions,” she said. “We need interpretations associated with it.” And before ML models can turn into actionable intelligence, there’s a great deal clinicians need to understand about these models.

Unpacking the black box

Van der Schaar listed three key things clinicians need before they can take action on, with, or from an ML model: transparency, risk understanding, and avoidance of implicit bias.

She drew a distinction between interpretability and explainability. “Explainability is tailored interpretability because different users seek different forms of understanding,” she said. For example, a clinician wants to know why a certain treatment was given, whereas a researcher wants a hypothesis to take to the lab, and patients want to know if they should make a lifestyle change.

Dr. Chris Paton, a medical doctor from the University of Oxford, said during his portion of the AI for Affordable Healthcare workshop that it’s important for ML makers seeking explainability to understand how a clinician thinks. “When clinicians make decisions, they normally have a mental model in their head about how the different components of that decision are coming together. And that makes them able to be confident in their diagnosis,” he said. “So if they don’t know what those parameters coming together are — [if] they’re just seeing a kind of report on a screen — they’ll have very little understanding [with] a particular patient how confident they should be in that [diagnosis].”

But he also noted that the need for explainability decreases as the risk to the patient decreases. For example, checking a person’s heart rate while they jog carries inherently less risk than trying to diagnose a serious illness, where the ramifications of inaccuracy or misunderstanding can be grave.

Van der Schaar believes it’s possible to make black box models more explainable with a technique called symbolic metamodels. A metamodel, of course, is a model of a model. The idea is that you don’t need access to a black box model; you just need to be able to query the inputs and determine the output. “A symbolic metamodel outputs a transparent function that describes the prediction of the black box model,” she said. Ostensibly, that obviates the problem of seeing inside the black box model, which preserves intellectual property, while also granting some explainability.

Overcoming low or noisy data

Several of the presentations during the AI for Affordable Healthcare workshop highlighted ways to overcome data limitations, including incomplete or inconsistent data and noisy data — particularly as it pertains to imaging.

Edward Choi, assistant professor at KAIST (Korea Advanced Institute of Science and Technology), gave a talk called “Knowledge Graph and Representation Learning for Electronic Health Records” (EHR). The goal is to combine the health expertise of a clinician (i.e., “domain knowledge”) with neural networks to model EHR where there’s a low amount of data. If a disease is rare, for example, there might not be much data on it at all.

Other ailments, like heart failure, prevent a different sort of data problem. “What makes [heart failure] very difficult is … the cues are very subtle in the intermediate stages,” he said. He explained that by the time the symptoms of heart failure are obvious and it’s easy to diagnose, it’s usually too late for the patient — there’s already a high morbidity rate. Early intervention is necessary. “The key is to predict it as soon as possible,” he said.

A way to do that is to track a patient’s EHR longitudinally — over the course of, say, a year — to find patterns and clues to an impending heart failure. In his research, he used recurrent neural networks (RNNs) to convert the sparse representation — the longitudinal data from a patient’s hospital visits over time, represented by various codes in the EHR — into a compact representation. At the end of it, the output is a 1 or a 0, with a 1 indicating that the patient could experience heart failure.

The next step in his research was a model called GRAM (graph-based attention model) for health care representation learning, which he and his co-researchers created to improve upon the aforementioned RNN technique. The two big challenges they tackled were data insufficiency and interpretation. “In order to properly use RNNs, or any large-scale neural networks … you need a lot of data to begin with,” he said.

The solution they came up with was to rely on established medical ontologies, which he described as “hierarchical clinical constructs and relationships among medical concepts.” They’re taxonomies of disease classifications, structured like a tree, with branches of items related to that disease. At the bottom of the chart is a “leaf” — a five-digit code for that disease. (The codes were designed to help with billing processes.)

These ontologies are developed by domain experts, so they’re reliable in that sense. And by looking at closely related concepts within those ontologies, experts can infer similar knowledge between them. For example, if a rare disease has an ontology that’s similar to a more common disease, they can apply the knowledge from one to the other.

Another extension of Choi’s research is Graph Convolutional Transformer (GCT), which he developed while at Google Health. It looks at what to do when EHR has no structure, unlike his work on GRAM. “When we start the training to do some kind of supervised prediction test, like heart failure prediction, we assume that everything in the [doctor or hospital] visit is connected to each other,” he said. He explained that everything in the EHR creates a sort of graph comprising interconnected nodes. If you have a fully fleshed-out graph, that’s a “ground truth” starting point. With patient visits where some of the nodes are missing data, the model is designed to infer that missing data to predict outcomes.

Medical codes simply aren’t always included in data, whether by oversight or because the data comes from a country that doesn’t use them. That’s part of the challenge that Constanza Fierro, a Google software engineer, tackled with coauthors in a paper titled “Predicting Unplanned Readmissions with Highly Unstructured Data.” They looked at “unplanned readmissions,” which is when a patient is unexpectedly rehospitalized less than 30 days after discharge. These readmissions are important to examine because they’re expensive and may indicate poor health care quality.

“This is why clinics and the government would like to have a way to predict which patients are highly probable to be readmitted, so they can focus on these patients — giving them follow-up calls or follow-up visits” and prevent readmissions, said Fierro in a presentation during the workshop. She said that there’s been a lot of work done on this task, and deep learning has proven useful — but all of the research has been done on English-language data sets and in developed countries, largely following standard codes and stored data. “The problem with this is that in developing countries, such as Chile, codes for treatment are not stored,” she said, adding that codes for diagnosis are only sometimes stored, often depending on the doctor involved.

Fierro and her colleagues propose a deep learning architecture that can achieve state-of-the-art results on Spanish-language medical data that is highly unstructured or noisy. In the paper abstract, the authors claim that “our model achieves results that are comparable to some of the most recent results obtained in U.S. medical centers for the same task.”

“I hope this work motivates other people to test the latest deep learning techniques in developing countries so we can understand what are the challenges, and we can try different ways to overcome them,” said Fierro at the conclusion of her workshop talk.

Imaging’s manifold challenges

GCT is a knowledge graph that works by stripping away bits and pieces from a full graph — to “sparsify,” Choi said. A somewhat similar approach called IROF, or “iterative removal of features,” is designed to strip images down to only the parts an AI model needs in order to be accurate. Like Choi’s work, the proposed advantage of IROF is twofold: It’s helpful for accuracy when data is poor (the missing EHR data in Choi’s work, or in this case, a blurry image), and it also helps clinicians with explainability.

“In most cases, the human [clinician] will make the final diagnosis, and the machine learning model will only provide a classification to aid the human in the diagnosis,” said Laura Rieger, a computer science Ph.D. student from the University of Denmark, in her presentation during the AI for Affordable Healthcare workshop. She explained that although there are many existing evaluation methods, models are often dependent upon certain data types and data sets, and visual checks can be misleading. She said there needs to be an objective measure to tell a researcher which explanation method is right for their task. IROF, she said, provides that objective measure with low computational cost, and using little data.

The process starts with an existing data set and an image for input. The researcher sends the input (an image) through the neural network and gets a simple, correct output. (In her example, Rieger used a picture of a monkey, which was easy for the neural network to identify.) They check the explanation method, which outputs a black and white image in which the lighter parts of the image are more important for the classification accuracy than the dark parts. Then, using computer vision, they can algorithmically segment the images by pixels and color space. “I can transfer this segmentation to my explanation and can see that I have more and less important segments, or superpixels, for my classification,” she said.

They can drill down to the superpixels that are the most important for classification, and replace them with “mean values.” When they rerun the test, the probability of a correct classification drops. They can then identify the second-most important part of the image, apply the mean value again, and rerun it again. Eventually, with those results in a chart, you get a curve of the degradation.

“If my explanation method is reliable, it will identify important parts for the classification as important; they will be removed first, and the curve will go down fast. If it’s bad, it will remove unimportant parts of the image first,” she said.

After many such tests, with many subsequent (and similar) curves added to the chart, the area over the curve offers a single quantitative value for the quality of the explanation method — that’s the IROF score. The method works on medical images, too, Rieger said.

Sarah Hooper, an electrical engineering Ph.D. student at Stanford University, presented work that’s designed to help clinicians triage CT scans even when the image quality is poor. Hooper said CT scans are widely used in health care systems, especially for head scans that can show life-threatening abnormalities, like strokes or fractures.

“CT triage, which automatically identifies abnormal from normal images, could help ensure that patients with abnormal images are seen quickly,” she said. “This type of triage tool could be particularly valuable in health care systems with fewer trained radiologists.”

Hooper and her coauthors wanted to create an automated head CT triage tool that simply labels an image as “normal” or “abnormal” — a challenge magnified by images that are noisy or have artifacts. Hooper explained that the quality of CT image scans differs significantly in many parts of the world, and most of the work that’s been done on automatic image classification like this so far has used “relatively homogenous and high-quality” images. Her work focuses on using a convolutional neural network (CNN) to perform image classification on a range of lower-quality images that are often the result of lower-quality CT imaging systems.

The team started with a data set of 10,000 radiologist-labeled images and simulated noisy images from those real ones to test their CNN. They used CatSim to create the synthetic data set, which included the types of noisy images clinicians are likely to see in the wild, like limited-angle scans, and trained a model on them.

simulating noisy images from real data

On two of the three types of degraded images (tube current and projection), they found that their triage model worked well; after retraining to focus on the third type (limited-angle scans), their model performed admirably on that metric, too. “This may seem a bit surprising, as the limited-angle scans are difficult for the human eye to interpret,” she said. But the information needed for triage is still there in the images, and the CNN can find it.

Other imaging work presented during the workshop looks at automating 3D ultrasound scans for carotid plaque, diagnosing malaria from microscopy images, using AI to enhance stereo camera information in computer-assisted surgeries, using image quality transfer to artificially enhance MRI images, improving image classification for breast cancer screenings, and more.

Revolution and optimism

The challenges facing AI researchers and clinicians are not particularly unique to health care, broadly speaking — data quality and explainability are omnipresent issues in any AI work — but they take a unique form in the medical field. Missing data from medical records, a lack of data on rare diseases, low-quality imaging, and the need for AI models to jibe with clinical processes are critical factors for diagnoses and treatments when lives hang in the balance.

But van der Schaar is optimistic, especially about ways AI can make a difference around the world. In response to a question during a Q&A chat, she wrote, “I believe ML can be very useful in countries with limited access to health care. If made robust, ML methods can assist with enhanced (remote) diagnosis, treatment recommendations (and second opinions), [and] monitoring. It can also help with monitoring patients effectively, in a personalized way and at a low cost. Also, one can discover who is the best local expert to treat/diagnose a patient.”

“Machine learning really is a powerful tool, if designed correctly — if problems are correctly formalized and methods are identified to really provide new insights for understanding these diseases,” she said at the conclusion of her keynote. “Of course, we are at the beginning of this revolution, and there is a long way to go. But it’s an exciting time. And it’s an important time to focus on such technologies. I really believe that machine learning can open clinicians and medical researchers [to new possibilities] and provide them with powerful new tools to better [care for] patients.”