Microsoft researchers claim 'state-of-the-art' biomedical NLP model

In a paper published on the preprint server Arxiv.org, Microsoft researchers propose an AI technique they call domain-specific language model pretraining for biomedical natural language processing (NLP). By compiling a "comprehensive" biomedical (NLP) benchmark from publicly available data sets, the coauthors claim they managed to achieve state-of-the-art results on tasks including named entity recognition, evidence-based medical information extraction, document classification, and more.

In specialized domains like biomedicine, when training an NLP model, previous studies have shown domain-specific data sets can deliver accuracy gains. But a prevailing assumption is that "out-of-domain" text is still helpful; the researchers question this assumption. They posit that "mixed-domain" pretraining can be viewed as a form of transfer learning, where the source domain is general text (such as a newswire and the web) and the target domain is specialized text (such as biomedical papers). Building on this, they show domain-specific pretraining of a biomedical NLP model outperforms the pretraining of generic language models, demonstrating that mixed-domain pretraining isn't always the right approach.

To facilitate their work, the researchers conducted comparisons of modeling for pretraining and task-specific fine-tuning by their impacts on biomedical NLP applications. As a first step, they created a benchmark dubbed Biomedical Language Understanding & Reasoning Benchmark (BLURB), which focuses on publications available from PubMed and covers tasks like relation extraction, sentence similarity, and question answering, and classification tasks like yes/no question-answering. To compute a summary score, the corpora within BLURB are grouped together by task type and scored individually, after which an average is computed across all of them.

To evaluate their pretraining approach, the study coauthors generated a vocabulary and trained a model on the latest collection of PubMed documents: 14 million abstracts and 3.2 billion words totaling 21GB. Training took about five days on one Nvidia DGX-2 machine with 16 V100 graphics cards, with 62,500 steps and a batch size comparable to the computation used in previous biomedical pretraining experiments. (Here, "batch size" refers to the number of training examples utilized in one iteration.)

Compared with biomedical baseline models, the researchers say their model -- PubMedBERT, which is built atop Google's BERT -- "consistently" outperforms all the other models in most biomedical NLP tasks. Adding the full text of articles from PubMed to the pretraining text (16.8 billion words) led to a slight degradation in performance until the pretraining time was extended, interestingly, which the researchers partly attribute to noise in the data.

"In this paper, we challenge a prevailing assumption in pretraining neural language models and show that domain-specific pretraining from scratch can significantly outperform mixed-domain pretraining such as continual pretraining from a general-domain language model, leading to new state-of-the-art results for a wide range of biomedical NLP applications," the researchers wrote. "Future directions include: further exploration of domain-specific pretraining strategies; incorporating more tasks in biomedical NLP; extension of the BLURB benchmark to clinical and other high-value domains."

To encourage research in biomedical NLP, the researchers created a leaderboard featuring the BLURB benchmark. They've also released their pretrained and task-specific models in open source.

More