Paper Review

Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Modern BERT: with modern design, better training, and more data!

Andrew Lukyanenko

4 min readDec 23, 2024

ModernBERT introduces modern optimizations to BERT: trained on 2 trillion tokens with an 8192 sequence length, it delivers state-of-the-art results across diverse classification and retrieval tasks, including code-related domains. ModernBERT is also the most speed and memory efficient encoder, making it well-suited for inference on common GPUs.

The approach

The authors update BERT with modern approaches, including better architecture design and training techniques, as well as using more data.

Architectural Improvements

Bias terms are disabled in all linear layers except the final decoder layer and are also removed from Layer Norms to allocate more parameters to linear layers.
The authors use rotary positional embeddings instead of absolute positional embeddings for superior performance in both short- and long-context models.
A pre-normalization block with LayerNorm is used for training stability, including an additional LayerNorm after the embedding layer while removing the redundant one in the first attention layer.
The GeGLU activation function is used as an improvement over GeLU.

Efficiency Improvements

ModernBERT alternates between global attention (Flash Attention 3) in every third layer and local sliding window attention (Flash Attention 2) in the remaining layers, optimized with RoPE and Flash Attention for efficiency.
Unpadding (sequence packing) is implemented to avoid wasted compute on padding tokens, combining sequences into a single unpadded batch. With Flash Attention, ModernBERT achieves 10–20% performance improvements over unpadding methods.
Uses PyTorch’s torch.compile to improve training throughput by 10% with minimal overhead.
ModernBERT is designed as a Deep & Narrow model for better downstream performance, optimizing for GPUs like NVIDIA T4 and RTX 3090. It has 22 layers (base, 149M parameters) and 28 layers (large, 395M parameters) with 768 / 1024 hidden size to maximize efficiency across tensor cores and GPU architectures.

Training

ModernBERT is trained on 2 trillion tokens of primarily English text, including web documents, code, and scientific literature.
It uses a modern BPE tokenizer based on a modified OLMo tokenizer, which improves token efficiency and performance on code-related tasks. The tokenizer uses the same special tokens ([CLS] and [SEP]) and templating to maintain compatibility with the original BERT model while having a vocabulary size of 50,368.
To address minibatch-size variance caused by unpadding, a greedy sequence packing algorithm is used, achieving over 99% efficiency and ensuring uniform batch sizes during training.
ModernBERT uses a 30% masking rate for MLM objective and removes the Next-Sentence Prediction objective for efficiency and improved performance.
Weight initialization is done via tiling for the larger model size, like in Microsoft’s Phi family of models.
It uses StableAdamW for better and more stable training; a modified trapezoidal LR schedule with warmup and decay to enable continual training; the batch size is gradually increased during warmup.
Context Length Extension: Initially trained on a 1,024 sequence length, the model extends to an 8192 by adjusting RoPE theta and further fine-tuning.

Experiments

ModernBERT achieves state-of-the-art performance across various downstream tasks (short- and long-context retrieval, Natural Language Understanding, Code Understanding), demonstrating significant improvements over previous encoder models like BERT, RoBERTa, and GTE-en-MLM.

For short-context inputs, ModernBERT processes 512-token sequences faster than other recent encoders, although it is slightly slower than the original BERT and RoBERTa models due to their lower parameter counts. In long-context tasks, ModernBERT processes long documents 2.65 times faster at the BASE size and 3 times faster at the LARGE size compared to the next-fastest models.

ModernBERT-base can process batch sizes twice as large as any other model for both short and long contexts. ModernBERT-large, while slightly less memory efficient than the original BERT-large for short-context inputs, handles batches that are at least 60% larger than those of other large models.

Additional details

Exponential Moving Average never improved performance. ModernBERT-base is the result of averaging the 3 best performing annealing checkpoints with the final one. Averaging did not yield successful results on the large size.
Design choices to maximize performance: attention heads — multiples of 64, embedding matrix — power of 2 or multiple of 64, weight matrix dimensions — multiples of 64, weight matrix — divisible into 128x256 blocks, number of blocks — divisible by the number of streaming multiprocessors.
PyTorch’s distributed random sampler returns sequentially biased samples when the number of samples is somewhere between 500 million and 1 billion samples. The authors had to use NumPy’s PCG64DXSM random sampler.