Breast cancer is the second leading cancer-related cause of death among women in the U.S. It’s estimated that in 2015, 232,000 women were diagnosed with the disease and approximately 40,000 died from it. And while diagnostic exams like mammography have come into wide practice — in 2014, over 39 million breast cancer screenings were performed in the U.S. alone — they’re not always reliable. About 10 to 15 percent of women who undergo a mammogram are asked to return following an inconclusive analysis.

That’s why researchers at New York University are investigating an AI-driven technique that promises much higher precision than today’s tests. In a newly published paper on (“Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening“), they describe a deep convolutional neural network — a class of machine learning algorithm commonly used in image classification — that notches an area under the ROC curve (AUC) of 0.895 in predicting the presence of a cancerous breast tumor. Moreover, they claim that when averaged with the probability of malignancy predicted by a radiologist from the AI system’s results, the AUC is higher than either method achieves separately.

The code and best-performing pretrained models are available on Github.

“In this work, we propose a novel neural network architecture … to efficiently handle a large dataset of high-resolution breast mammograms with biopsy-proven labels,” the paper’s authors explain. “We experimentally show that our model is as accurate as an experienced radiologist and that it can improve the accuracy of radiologists’ diagnoses when used as a second reader … The fact that previously unseen exams with malignancies were found by the network to be similar further corroborates that our model exhibits strong generalization capabilities.”

The team began by sourcing a data set comprising 229,426 digital screening mammography exams (1,001,093 images) from 141,473 patients, each of which contained at least four images corresponding to the four views typically used in mammography screenings (right craniocaudal, left craniocaudal, right mediolateral oblique, and left mediolateral oblique). They extracted labels from 5,832 exams with at least one biopsy performed within 120 days of the screening mammogram, and then recruited a team of radiologists — all of whom were provided supporting pathology reports — to indicate where the biopsies were taken “at the pixel level.”

For each mammographed breast, the researchers assigned two binary labels for a total of four labels — (1) the absence or (2) presence of malignant findings, and (3) the absence or (4) presence of benign findings — which they used to train the convolutional network. They additionally fed an auxiliary, patch-level AI system with labels from the radiologists’ pixel-level analyses, and they used both it and the primary model’s predictions to create heatmaps for exam images estimating the probability of malignant and benign findings.

In experiments involving a test set of 740 randomly selected biopsied and non-biopsied screening mammogram exams, despite noise and ambiguity in certain labels, the models together achieved between 0.738 AUC and 0.895 AUC and between 0.642 and 0.779 AUC in predicting malignant and benign tumors, respectively, across patient populations.

To further validate the model, the paper’s authors conducted a reader study with 14 radiologists — 12 attending, a resident, and a medical student — they tasked with analyzing the test set of 1,480 breasts. AUCs varied from 0.705 to 0.860, but when each was provided the AI system’s predictions to cross-check with their own, the group’s average AUC increased to 0.891.

The researchers concede that the training data set is relatively small, and that despite the model’s architectural simplicity, it’d be impossible to train end-to-end with most consumer hardware. Still, they say that it’s a promising step toward a generalizable cancer-screening model that might help to guide clinicians in making diagnostic decisions.

“[The results] suggest that our network and radiologists learned different aspects of the task and that our model could be effective as a tool providing radiologists a second reader,” the researchers wrote. “With this contribution, research groups that are working on improving screening mammography, which may not have access to a large training dataset like ours, will be able to directly use our model in their research or to use our pretrained weights as an initialization to train models with less data. By making our models public, we invite other groups to validate our results and test their robustness to shifts in the data distribution.”

NYU isn’t the only institution applying AI to breast cancer detection. Last year, Google said a system it developed — dubbed Lymph Node Assistant, or LYNA — achieved an area under the receiver operating characteristic AUC of 99 percent. Baidu also claims to have architected a deep learning algorithm that outperforms human pathologists in its ability to identify breast cancer metastasis.