Developer Blog: Efficient BERT: Finding Your Optimal Model with Multimetric Bayesian Optimization

This is the first post in a series about distilling BERT with multimetric Bayesian optimization. Part 2 discusses the set up for the Bayesian experiment, and Part 3 discusses the results.

You’ve all heard of BERTErnie’s partner in crime. Just kidding! I mean the natural language processing (NLP) architecture developed by Google in 2018. That’s much less exciting, I know. However, much like the beloved Sesame Street character who helps children learn the alphabet, BERT helps models learn language. Based on Vaswani et al’s Transformer architecture, BERT leverages Transformer blocks to create a malleable architecture suitable for transfer learning.

Before BERT, each core NLP task (language generation, language understanding, neural machine translation, entity recognition, and so on) had its own architecture and corpora for training a high performing model. With the introduction of BERT, there was suddenly a strong performing, generalizable model that could be transferred to a variety of tasks. Essentially, BERT allows a variety of problems to share off-the-shelf pretrained models and moves NLP closer to standardization, like how ResNet changed computer vision. For more information, see Why are Transformers important?, Sebatian Ruder’s excellent analysis The State of Transfer Learning in NLP, or BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (PDF).

But BERT is really, really large. The BERT-Base is 110M parameters and BERT-Large is 340M parameters, compared to the original ELMo model that is ~94M parameters. This makes BERT costly to train, too complex for many production systems, and too large for federated learning and edge-computing.

To address this challenge, many teams have compressed BERT to make the size manageable, including HuggingFace’s DistilBert, Rasa’s pruning technique for BERT, Utterwork’s fast-bert, and many more. These works focus on compressing the size of BERT for language understanding while retaining model performance.

Read the full post on the NVIDIA Developer Blog.