Researchers release data set to evaluate COVID-19 chatbots and search engines

In a paper published this week on the preprint server Arxiv.org, researchers at Facebook, New York University, and the University of Waterloo detail at question-answering data set -- CovidQA -- that comprises submissions from the COVID-19 Open Research Dataset Challenge, a collection of tasks based on scientific questions developed with the World Health Organization and the National Academies of Sciences, Engineering, and Medicine. They say that CovidQA, which is a work in progress, could help gauge the accuracy of chatbots and search engines that answer topics about the novel coronavirus.

Countries, health systems, and nonprofits around the world are employing AI natural language tools to triage potential COVID-19 patients. But as our investigation in early April revealed, chatbots, in particular, rely on inconsistent medical data sources and privacy practices. Data sets like CovidQA could be used to empirically compare the accuracy of the answers supplied by COVID-19 chatbots, exposing gaps in their knowledge and giving users greater peace of mind.

Version 0.1 of CovidQA contains 124 question-document pairs, 27 questions, and 85 unique articles, which the paper's coauthors created from the literature review page of the COVID-19 Open Research Dataset Challenge. Broad topics like "decontamination based on physical science" are deconstructed into multiple questions -- for example, "UVGI intensity used for inactivating COVID-19" and "purity of ethanol to inactivate COVID-19." And each question is associated with query-containing keywords (i.e., what a user might type into a search engine) and "well-formed" natural language questions.

For example, one annotated question-document pair within CovidQA falls under the category "asymptomatic shedding" and the subcategory "proportion of patients who were asymptomatic," with the associated query "proportion of patients who were asymptomatic" and question "What proportion of patients were asymptomatic?" The answer -- e.g., "49 (14.89%) were asymptomatic" -- contains the title of the scientific study from which it was sourced.

The researchers note that CovidQA is too small to train a supervised machine learning model (i.e., a model that learns from labeled data) -- at least not without supplementary data from another source. But they assert that it can be used to evaluate a model by feeding it a question or keyword query and observing how it scores the relevance of each sentence. If the top-scoring sentence contains the exact answer, the model answered the question correctly.

In an experiment, the researchers found that BioBERT, a BERT-based biomedical language representation model designed for text mining tasks, performed the best out of several models tested against CovidQA, correctly ranking answers to questions on average 40.4% of the time. The hope is to eventually improve CovidQA so that it can evaluate models' ability to detect when an answer isn't present in documents contained within the data set and to generalize the data set into a methodology for rapidly building biomedical test sets.

"The empirical nature of modern [natural language processing] research depends critically on evaluation resources that can guide progress," wrote the coauthors. "For rapidly emerging domains, such as the ongoing COVID-19 pandemic, it is likely that no appropriate domain-specific resources are available at the outset. Thus, approaches to rapidly build evaluation products are important."

More