Paper Review

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Visual Language Encoder with Improved Semantic Understanding, Localization, and Dense Features!

6 min readJust now

SigLIP 2 is a new family of multilingual vision-language encoders that improve upon the original SigLIP by adding caption-based pretraining, self-supervised learning (self-distillation, masked prediction), and online data curation. As a result, SigLIP 2 models achieve superior performance in zero-shot classification, image-text retrieval, and visual representation extraction for VLMs. They also show significant gains in localization and dense prediction tasks and support multiple resolutions while preserving aspect ratios.

SigLIP 2 is released in four model sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

The approach

Architecture, training data, optimizer

SigLIP 2 keeps the architecture of the original SigLIP, allowing users to easily swap encoder weights. It uses the ViT architecture with learned positional embeddings, with identical image and text encoders except for the largest vision model, which pairs with a So400m-sized text encoder. Representations are pooled using an attention-based MAP head. Text inputs are capped at 64 tokens (tokenized with the multilingual Gemma tokenizer (256k vocabulary)).

The WebLI dataset (10 billion images and 12 billion alt-texts across 109 languages) is used for training. The training data mix is 90% English and 10% non-English. The model is trained on 2048 TPUv5e chips with a fully-sharded data-parallel strategy.

Training with Sigmoid loss and decoder

SigLIP 2 combines SigLIP and LocCa losses during pretraining. Unlike CLIP, which uses contrastive loss, SigLIP treats image-text matching as a binary classification problem, training embeddings with logistic regression.

LocCa adds a transformer decoder with cross-attention to the un-pooled vision encoder representation. This decoder, which has fewer layers than the text encoder, trains on three tasks: image captioning, referring expression prediction, and grounded captioning. Region-caption pairs are automatically labeled using n-gram extraction and open-vocabulary detection.

Training with self-distillation and masked prediction

In the local-to-global consistency loss, inspired by SILC, the vision encoder acts as the student network, processing local (partial) image patches and learning to match the full-image representation generated by a teacher network. The teacher’s parameters are updated using an Exponential Moving Average of the student’s past parameters. The authors use one teacher with eight students.

In the masked prediction loss, based on TIPS, 50% of embedded image patches in the student model are replaced with mask tokens. The student is then trained to match the teacher’s features at the masked locations. Unlike the first loss, which focuses on the full image representation, this loss applies to individual per-patch features. Both teacher and student see the same global image.

These additional losses are added at 80% of training completion, with the teacher initialized from the student model while the extra parameters (heads, mask tokens, and optimizer parameters) are initialized randomly. The original image is used for computing SigLIP and LocCa losses, while augmented views are used for the new losses to ensure image-text alignment remains unaffected.

Adaptation to different resolutions

To obtain fixed-resolution checkpoints at multiple resolutions, SigLIP 2 resumes training from the original checkpoints (sequence length 256, patch size 16) at 95% of training completion. The positional embeddings are resized to match the target sequence length. Training continues at the new resolution with all loss functions applied.

NaFlex extends the ideas of FlexiViT and NaViT to allow a single ViT model to support multiple predefined sequence lengths while also processing images at their native aspect ratio. This minimizes aspect ratio distortion, which is particularly useful for tasks like OCR and document image processing.

NaFlex resizes images so that their dimensions remain multiples of the patch size. The resized image is then split into patches, with patch coordinates and padding information added if the sequence length is smaller than the target length. Positional embeddings are bilinearly resized with anti-aliasing to match the non-square patch grid of the resized input.

NaFlex training starts from the default SigLIP 2 checkpoints, which were initially trained with non-aspect preserving resizing to 256px (sequence length 256). At 90% of training completion, it switches to aspect-preserving resizing and uniformly samples sequence lengths from 128, 256, 576, 784, 1024.

To keep complexity manageable, self-distillation and masked prediction losses are not applied during this training.

Distillation via active data curation

To enhance the performance of the smallest fixed-resolution models, SigLIP 2 applies a knowledge distillation during a short fine-tuning stage (4b examples with only sigmoid image-text loss).

The authors use the ACID method for implicit “distillation through data.” At each training step, both the teacher model and the current learner model score examples based on their “learnability,” selecting the most informative batch from a larger super-batch. However, instead of two teachers, a single strong teacher model is first fine-tuned based on 1B examples from a curated high-quality dataset. This fine-tuned teacher, which blends diverse pretraining knowledge with high-quality curated data, is then used in the ACID. This results in implicit knowledge transfer, achieving results comparable to ACED without explicit softmax distillation.

Experiments

SigLIP 2 outperforms SigLIP and other open-weight baselines in zero-shot classification and image-text retrieval, despite supporting multiple languages. It significantly improves recall, especially for smaller models due to distillation.

The NaFlex variant excels on OCR/document-based retrieval tasks, but, for natural image benchmarks, the standard B-sized model outperforms NaFlex, likely due to its distillation step.

SigLIP 2 is evaluated for visual representation extraction in VLMs by integrating it with the Gemma 2 2B LLM and training on 50M multimodal examples. Results show that SigLIP 2 outperforms SigLIP across all resolutions and model sizes.

SigLIP 2 demonstrates strong performance across dense prediction and localization tasks:

In semantic segmentation, depth estimation, and surface normal estimation, SigLIP 2, evaluated with a linear layer or DPT decoder, outperforms previous CLIP-style vision encoders, including SigLIP, by a significant margin.
In open-vocabulary segmentation, SigLIP 2 surpasses SigLIP and even the larger OpenCLIP G/14 model.
In referring expression comprehension, SigLIP 2 outperforms SigLIP, CLIP, and image-captioning pretraining models. However, it is outperformed by LocCa, most likely due to SigLIP 2’s multilingual pretraining versus LocCa’s English-only data.
In open-vocabulary detection, SigLIP 2 improves upon SigLIP with the most significant gains in LVIS rare categories. It also achieves better results than OWL-ViT, most likely due to using SigLIP instead of CLIP.