Researchers release data set of CT scans from coronavirus patients

In an effort to spur the development of systems that can quickly spot signs of the novel coronavirus, a team of researchers at the University of San Diego this week released a data set -- the COVID-CT-Dataset -- containing 275 CT scans collected from 143 patients with confirmed cases of COVID-19, which they claim is the largest of its kind. To demonstrate its potential, they trained an AI model to achieve an accuracy of 85% -- accuracy they say could be improved with further model optimization.

It's important to note that the U.S. Centers for Disease Control and Prevention recommends against the use of CT scans or X-rays for COVID-19 diagnosis, as does the American College of Radiology (ACR) and radiological organizations in Canada, New Zealand, and Australia. That's because even the best AI systems sometimes can't tell the difference between COVID-19 and common lung infections like bacterial or viral pneumonia.

However, folks like Intel's Xu Cheng assert the systems from companies like Alibaba, RadLogics, Lunit, DarwinAI, Infervision, and Qure.ai might play a role in triage by indicating that further testing is required. "Simply put, it's challenging for health care professionals and government officials to allocate resources and stop the spread of the virus without knowing who is infected, where they are located, and how they are affected," Cheng said last week in reference to a system from Huiying Medical that detects COVID-19 with a claimed 96% accuracy.

In this case, the researchers didn't collect the scans themselves. As they point out in a paper, the images used in much of the existing research on COVID-19-diagnosing systems haven't been shared due to privacy concerns. Instead, the team scoured 760 COVID-19 studies published from January 19 to March 25 and used a tool called PyMuPDF to extract low-level structure information. This allowed them to locate the embedded figures within the studies and identify the captions associated with the figures.

Next, the researchers manually selected clinical COVID-19 scans by reading the captions. For scans they weren't able to judge from the caption, they looked for text referring to the figure to make a decision. Any figure containing multiple CT scans as sub-figures was manually split into individual images.

The team concedes that because the data set is small, training models on it could lead to overfitting, when the model performs well on the training data but generalizes badly to testing data. To mitigate this problem, they pretrained an AI system on the National Institute of Health's ChestX-ray14 data set -- a large collection of chest X-ray images -- and fine-tuned it on the COVID-CT data set. Additionally, they augmented each image in COVID-CT by cropping, flipping, and transforming it to create synthesized pairs.

After training the model on 146 non-COVID scans and 183 COVID-19 scans, the researchers report that their baseline AI system demonstrated high precision but low recall, which in this context refers to the ability of the model to find all the relevant cases within the data set. For the next step, the team says they'll continue to improve the method to achieve better accuracy.

Concerningly, it's unclear whether any of the researchers notified patients whose scans they scraped from the publicly available studies. We've reached out for clarification and will update this post when we hear back.

More