Microsoft Research Unveils New BERT-Based Biomedical NLP AI Model

To help accelerate natural language processing in biomedicine, Microsoft Research developed a BERT-based AI model that outperforms previous biomedicine natural languag processing (NLP) methods. The work promises to help researchers rapidly advance research in this field.

The model, built on top of Google’s BERT, can classify documents, extract medical information, detect specific name entities, and much more.

To develop the model, called PubMedBERT, the researchers compiled a comprehensive biomedical NLP benchmark, sourced from publicly-available datasets.

“Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board,” the researchers stated in their paper, Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing.

The model was trained using vocabulary from the PubMed dataset which includes over 14 million abstracts and 3.2 billion words. They used an NVIDIA DGX-2 with 16 NVIDIA V100 GPUs, and NVIDIA’s NGC TensorFlow implementation.

Source: Microsoft. “The prevailing mixed-domain paradigm assumes that out-domain text is still helpful and typically initializes domain-specific pretraining with a general-domain language model and inherits its vocabulary. Bottom: Domain-specific pretraining from scratch derives the vocabulary and conducts pretraining using solely in-domain text.”

The researchers have released their state-of-the-art pretrained and task-specific models for the community, and have also created a leaderboard featuring a comprehensive benchmark called BLURB.