Google says its AI detects 26 skin conditions as accurately as dermatologists

Image Credit: Khari Johnson / VentureBeat

Skin conditions are among the most common kind of ailment globally, just behind colds, fatigue, and headaches. In fact, it’s estimated that 25% of all treatments provided to patients around the world are for skin conditions and that up to 37% of patients seen in the clinic have at least one skin complaint.

The enormous case workload and a global shortage of dermatologists have forced sufferers to seek out general practitioners, who tend to be less accurate than specialists when it comes to identifying conditions. This trend motivated researchers at Google to investigate an AI system capable of spotting the most common dermatological disorders seen in primary care. In a paper (“A Deep Learning System for Differential Diagnosis of Skin Diseases“) and accompanying blog post, they report that it achieves accuracy across 26 skin conditions when presented with images and metadata about a patient case, and they claim that it’s on par with U.S. board-certified dermatologists.

“We developed a deep learning system (DLS) to address the most common skin conditions seen in primary care,” wrote Google software engineer Yuan Liu and Google Health technical program manager Dr. Peggy Bui. “This study highlights the potential of the DLS to augment the ability of general practitioners who did not have additional specialty training to accurately diagnose skin conditions.”

Above: A schematic illustrating the AI system’s architecture.

Image Credit: Google

As Liu and Bui further explained, dermatologists don’t give just one diagnosis for any skin condition — instead, they generate a ranked list of possible diagnoses (a differential diagnoses) to be systematically narrowed by subsequent lab tests, imaging, procedures, and consultations. So too does the Google researchers’ system, which processes inputs that include one or more clinical images of the skin abnormality and up to 45 types of metadata (e.g., self-reported components of the medical history, such as age, sex, and symptoms).

The research team says it evaluated the model with 17,777 de-identified cases from 17 primary care clinics across two states. They bifurcated the corpus and used the portion of records dated between 2010 and 2017 to train the AI system, reserving the portion from 2017 to 2018 for evaluation. During training, the model leveraged over 50,000 differential diagnoses provided by over 40 dermatologists.

In a test of the system’s diagnostic accuracy, the researchers compiled diagnoses from three U.S. board-certified dermatologists. Just over 3,750 cases were aggregated to derive the ground truth labels, and the AI system’s ranked list of skin conditions achieved  71% and 93% top-1 and top-3 accuracies, respectively. Furthermore, when the system was compared against three categories of clinicians (dermatologists, primary care physicians, and nurse practitioners) on a subset of the validation data set, the team reports that its top three predictions demonstrated a top-3 diagnostic accuracy of 90%, or comparable to dermatologists (75%) and “substantially higher” than primary care physicians (60%) and nurse practitioners (55%).

Above: The AI system’s accuracy, trained on different data sets.

Image Credit: Google

Lastly, in order to evaluate potential bias toward skin type, the team tested the AI system’s performance based on the Fitzpatrick skin type, a scale that ranges from Type I (“pale white, always burns, never tans”) to Type VI (“darkest brown, never burns”). Focusing on skin types that represent at least 5% of the data, they found that the model’s performance was similar, with a top-1 accuracy ranging from 69% to 72%, and a top-3 accuracy from 91% to 94%.

The researchers credit the presence of metadata in the training corpus with the system’s overall accuracy and say the results suggest their approach might “help prompt clinicians … to consider possibilities” that weren’t originally in their differential diagnoses. However, they note that their training corpus was only taken from a one teledermatology service; that some Fitzpatrick skin types were too rare in their data set to allow meaningful training or analysis; and that their data set didn’t accurately detect some skin conditions, such as melanoma, due to a lack of available data samples.

“We believe these limitations can be addressed by including more cases of biopsy-proven skin cancers in the training and validation sets,” wrote Liu and Bui. “The success of deep learning to inform the differential diagnosis of skin disease is highly encouraging of such a tool’s potential to assist clinicians. For example, such a DLS could help triage cases to guide prioritization for clinical care or help non-dermatologists initiate dermatologic care more accurately and could potentially improve access [to care].”