Microsoft, White House, and Allen Institute release coronavirus data set for medical and NLP researchers

The COVID-19 Open Research Dataset (CORD-19), a repository of more than 29,000 scholarly articles on the coronavirus family from around the world, is being released today for free. The data set is the result of work by Microsoft Research, the Allen Institute for AI, the National Library of Medicine at the National Institutes of Health (NIH), the White House Office of Science and Technology (OSTP), and others. It includes machine-readable research from more than 13,000 scholarly articles. The aim is to empower the medical and machine learning research communities to mine text data for insights that can help fight COVID-19.

"The White House worked with the National Academies of Science, Engineering, and Medicine and the World Health Organization to identify dozens of high-priority scientific questions related to COVID-19 to inform the call to action," White House CTO Michael Kratsios said today in a teleconference call. "Artificial intelligence can be incredibly powerful to help scientists summarize and analyze the information."

The corpus of data comes with a call to action urging AI researchers to create data and text mining techniques to assist medical researchers. Increased data sharing and collaboration among scientific professionals could certainly play a role in combating COVID-19.

"Our goal in creating this open data set and [Kaggle] Q&A challenge for coronavirus is to stimulate the AI community to create tools that can help scientists stay on top of thousands of articles to enable them to develop approaches to addressing the COVID-19 pandemic," Microsoft chief scientific officer Eric Horvitz said during the call. A Microsoft tool was used to perform worldwide indexing and mapping of scholarly articles. "With a million new publications being published each year across all of biomedicine, AI will grow in importance as a critical companion to scientists."

Text mining can enable researchers to evaluate hypotheses, formulate research plans, understand seminal works, and do things like create question-answering bots. As part of the news today, the Allen Institute's Semantic Scholar will deploy an adaptive feed of existing coronavirus-related research.

"By interacting with the feed, you train it to understand your interests and what relevance means to you. So while the feed might start with kind of the top papers on coronavirus, depending on what papers you interact with and what you find useful and not useful, it will learn your preferences. So each scholar would get [a] somewhat different ordering of papers because their interest in the problem is different," Semantic Scholar general manager Doug Raymond told VentureBeat in a phone interview.

Semantic Scholar's personalized adaptive feed is powered based on work the Allen Institute has done on language models like ELMO and AllenNLP to understand relationships between paper content. Machine learning experts speaking with VentureBeat said Transformer-based advances in text generation and NLP are among the most significant developments of 2019, with more ahead in 2020.

"It's because we've had significant advances in NLP in the last couple years, the utility of a data set like this [will] likely be greater than it was a few years ago because there [are] more readily available tools," Raymond said.

Allen Institute for AI director Oren Etzioni said AI can help accelerate progress and unearth answers to questions but stressed that AI will augment humans and will not solve the problem on its own.

Multiple organizations are using NLP to fight COVID-19. Harvard Medical School developed a tool to review relevant data, such as patient records, social media, and public health data. BlueDot, a company that uses tools like NLP to scour news articles, public health data, and other sources, reportedly spotted the beginning of the COVID-19 outbreak before the World Health Organization sounded the alarm. In China, tech giants like Alibaba Cloud's Damo Academy are applying state-of-the-art NLP for text analysis of medical records and epidemiological investigation by China CDC officials. Last week, its StructBERT was named the top-performing NLP system in the world on the GLUE benchmark leaderboard.

Websites like PubMed, and Microsoft's Academic Graph, now have COVID-19 resource pages for medical researchers to browse. Partnerships with published literature and preprint repositories like arXiv.org and medrxiv.org will help keep the data set up to date. The Chan Zuckerberg Initiative and Georgetown University's Center for Security and Emerging Technology have also agreed to contribute knowledge. The joint effort has coalesced in the past week, and the most urgent unanswered questions will be listed on the Kaggle website, White House deputy CTO Lynne Parker said today.

As part of a five-year collaboration initiative, Harvard Medical School and the Guangzhou Institute will share $115 million in research funding provided by China Evergrande Group. Work at the Guangzhou Institute will be led by Zhong Nanshan, who currently acts as head of the Chinese 2019n-CoV Expert Taskforce and is director general of China State Key Laboratory of Respiratory Diseases.

Other forms of AI being applied to combat COVID-19 include disinfecting robots and deep learning to predict mortality rates and COVID-19 detection from CT scan imagery. Governments around the world have also turned to tech like GPS tracking, self-screening apps, text alerts, and movement tracking with smartphones. Other initiatives underway include an antibody discovery initiative between Abcellera and DARPA's Pandemic Prevention Platform program and Autonomous Diagnostics to Enable Prevention and Therapeutics (ADEPT) that's designed to stop disease outbreaks within 60 days.

The news of the open data set comes a week after White House CTO Michael Kratsios first shared a demo of the research repository during a teleconference with tech giants like Apple, Amazon, Facebook, Google, Microsoft, and Twitter. This teleconference was aimed at finding ways to fight the pandemic using artificial intelligence and data collected by tech companies.

Few details were shared about the teleconference, but the White House said the government and businesses discussed information sharing and the creation of new tech tools. Anonymous sources told the Washington Post that an Amazon employee reportedly offered the company's cloud reporting services for tracking travelers. VentureBeat reached out to Amazon for more details but has not heard back. As the number of COVID-19 cases in the United States continues to rise, President Trump has repeatedly been criticized for spreading misinformation.

Shortly after declaring a national emergency to accelerate federal funding last Friday, President Trump, Vice President Pence, and other administration officials said Google was creating a screening website that seemingly promised broad coverage. However, Google said in a statement that Alphabet subsidiary Verily is working on a screening site -- as part of its Project Baseline -- but that at launch it will only be available in two locations in the San Francisco Bay Area. Use of the site requires a Google account.

On Sunday, Google CEO Sundar Pichai announced the company is now working with the U.S. government to create a self-screening website for people wondering whether they should seek medical attention.

More