Why self-supervised learning is a medical AI game-changer

Self-supervised learning has been a fast-rising trend in artificial intelligence (AI) over the past couple of years, as researchers seek to take advantage of large-scale unannotated data to develop better machine learning models.

In 2020, Yann Lecun, Meta’s chief AI scientist, said supervised learning, which entails training an AI model on a labeled data set, would play a diminishing role as self-supervised learning came into wider use.

“Most of what we learn as humans and most of what animals learn is in a self-supervised mode, not a reinforcement mode,” he told a virtual session audience during the International Conference on Learning Representation (ICLR) 2020. And in a 2021 Meta blog post, LeCun explained that self-supervised learning “obtains supervisory signals from the data itself, often leveraging the underlying structure in the data.” Because of that, it can make use of a “variety of supervisory signals across co-occurring modalities (e.g., video and audio) and across large datasets — all without relying on labels.”

Growing use of self-supervised learning in medicine

Those advantages have led to the notable growing use of self-supervised learning in healthcare and medicine, thanks to the vast amount of unstructured data available in that industry – including electronic health records and datasets of medical images, bioelectrical signals, and sequences and structures of genes and proteins. Previously, the development of medical applications of machine learning had required manual annotation of data, often by medical experts.

This was a bottleneck to progress, said Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. Rajpurkar leads a research lab focused on deep learning for label-efficient medical image interpretation, clinician-AI collaboration design, and open benchmark curation.

“We've seen a lot of exciting advancements with our labeled data sets,” he told VentureBeat, but a “paradigm shift” was necessary to go from 100 algorithms that do very specific medical tasks to the thousands needed without going about a laborious, intensive process.

That is where self-supervised learning, with its ability to predict any unobserved or hidden part of an input from any observed or unhidden part of an input, has been a game-changer.

Highlighting self-supervised learning

In a recent review paper in Nature Biomedical Engineering, Rajpurkar, along with cardiologist, scientist and author Eric Topol and student researcher Rayan Krishnan, highlighted self-supervised methods and models used in medicine and healthcare, as well as and promising applications of self-supervised learning for the development of models leveraging multimodal datasets, and the challenges in collecting unbiased data for their training.

The paper, Rajpurkar said, was aimed at “communicating the opportunities and challenges that underlie this the shift in paradigm we're going to see over the upcoming years in many applications of AI, most certainly including medicine.”

With self-supervised learning, Rajpurkar explained that he, "… can learn about a certain data source, whether that's a medical image or signal, by using unlabeled data. That allows me a great starting point to do any task I care about within medicine and beyond without actually collecting large labeled datasets.”

Big achievements unlocked

In 2019 and 2020, Rajpurkar’s lab saw some of the first big achievements that self-supervised learning was unlocking for interpreting medical images, including chest X-rays.

“With a few modifications to algorithms that helped us understand natural images, we reduced the number of chest X-rays that had to be seen with a particular disease before we could start to do well at identifying that disease,” he said.

Rajpurkar and his colleagues applied similar principles to electrocardiograms.

“We showed that with some ways of applying self-supervised learning, in combination with a bit of physiological insights in the algorithm, we were able to leverage a lot of unlabeled data,” he said.

Since then, he has also applied self-supervised learning to lung and heart sound data.

“What's been very exciting about deep learning as a whole, but especially in the recent year or two, is that we've been able to transfer our methods really well across modalities,” Rajpurkar said.

Self-supervised learning across modalities

For example, another soon-to-be-published paper showed that even with zero-annotated examples of diseases on chest X-rays, Rajpurkar’s team was actually able to detect diseases on chest X-rays and classify them nearly at the level of radiologists across a variety of pathologies.

“We basically learned from images paired with radiology reports that were dictated at the time of their interpretation, and combined these two modalities to create a model that could be applied in a zero-shot way – meaning labeled samples were not necessary to be able to classify different diseases,” he said.

Whether you're working with proteins or images or text, the process is borrowing from the same sort of set of frameworks and methods and terminologies in a way that is more unified than it was even two or three years ago.

“That’s exciting for the field because it means that a set of advances on a general set of tools helps everybody working across and on these very specific modalities,” he said.

In medical image interpretation, which has been Rajpurkar's research focus for many years, this is "absolutely revolutionary," he said. "Rather than thinking of solving problems one at a time and iterat[ing] this process 1,000 times, I can solve a much larger set of problems all at once."

Momentum to apply methods

These possibilities have created momentum toward developing and applying self-supervised learning methods in medicine and healthcare, and likely for other industries that also have the ability to collect data at scale, said Rajpurkar, especially those industries that don’t have the sensitivity associated with medical data.

Going forward, he adds that he is interested in getting closer to solving the full swath of potential tasks that a medical expert does.

“The goal has always been to enable intelligent systems that can increase the accessibility of medicine and healthcare to a large audience,” he said, adding that what excites him is building solutions that don’t just solve one narrow problem: “We're working toward a world with models that combine different signals so physicians or patients are able to make intelligent decisions about diagnoses and treatments.”