LLaVA-OneVision (GitHub Link) is a new LLaVA-based multimodal large language model, developed independently from the original LLaVA team. If you're unfamiliar with LLaVA, consider checking the previous post on its basics. In this article, we'll explore LLaVA-OneVision and how it expands vision-language models to work across single images, multiple images, and videos.
* SigLIP vs. CLIP
- CLIP (from OpenAI) uses a contrastive learning approach, typically with a softmax-based final layer.
- SigLIP (from Google) follows a similar contrastive idea but replaces the softmax with a sigmoid-based method, which can improve performance in some open-set tasks.
Both SigLIP and CLIP are strong vision encoders. LLaVA-OneVision uses SigLIP to encode images (or video frames), then passes the outputs through a simple projection layer and a large language model.
“Higher AnyRes” Strategy
Key Idea: handle high-resolution and unusual aspect ratios (Fig.2)
- Split each image (or video frame) into multiple crops at a chosen resolution.
- Encode each crop into visual tokens using the SigLIP encoder.
- (New Step): If the total token count is too large (especially for high-resolution images), apply bilinear interpolation to reduce tokens per crop.
(Bilinear Interpolation: A method for resizing images by averaging the pixel values in a 2D grid. If an image is too large, this step can reduce its resolution before encoding, preventing the total token count from growing too big.)
In Figure 2,
- (a) shows the Higher AnyRes strategy with bilinear interpolation.
- (b) is the original AnyRes approach (without interpolation), which can produce more tokens than desired for high-resolution images.
Scenarios: (Fig.3)
1. Single Image
- Typically produces 729 tokens at a standard resolution (e.g., 384×384).
2. Multi-Image
- Each image is encoded into about 729 tokens, so more images mean more tokens.
3. Video
- Each frame is treated like a single image, but you can reduce tokens per frame if you have many frames. This keeps the total token count from getting too large and helps the model handle longer videos.
By balancing the total number of tokens across these scenarios, LLaVA-OneVision can transfer knowledge effectively between single-image, multi-image, and video tasks. All of this is done while keeping the minimalist design (SigLIP + projection + LLM) that comes from the original LLaVA approach.
Projection Layer Training Strategy
In LLaVA-OneVision, the projection layer is a simple two-layer MLP that converts visual features from the SigLIP encoder into tokens that the language model can understand. Unlike previous LLaVA versions, where the focus was primarily on single-image tasks, the projection layer here is trained with a new strategy that prepares it for a broader range of scenarios.
Stage-1 (Language-Image Alignment):
- The projection layer is initially trained exclusively using image-text pairs. This step ensures that visual features are properly aligned with the language model’s embedding space. The focus here is solely on achieving a robust mapping from vision to language.
Stage-1.5 (High-Quality Knowledge Learning):
- Next, high-quality data is injected to further refine the projection layer’s performance. This stage leverages carefully curated instruction data, allowing the layer to capture more nuanced and diverse visual representations.
Stage-2 (Visual Instruction Tuning):
- Finally, the projection layer, along with the rest of the model, is fine-tuned on a mixture of single-image, multi-image, and video data. This step adapts the projection layer to handle different visual scenarios, ensuring smooth task transfer across modalities.
The new strategy makes the projection layer more robust and flexible, ultimately enhancing the model’s ability to transfer knowledge across single-image, multi-image, and video tasks.
* High-Quality Data Generation in Three Steps
- Automated Generation: Use strong pre-trained models (e.g., GPT-4) to automatically produce detailed descriptions and instructions from images.
- Manual Curation: Experts carefully filter and refine the generated data to ensure accuracy and diversity.
- Diverse Data Sources: Aggregate data from various sources such as OCR, chart analysis, and synthetic datasets (including for Chinese tasks) to cover a broad range of visual scenarios.
Key points on LLaVA-OneVision
Unified Representation:
- Uses the “Higher AnyRes” strategy to split images and video frames into balanced visual tokens (with bilinear interpolation when needed).
Multi-Modal Training:
- Employs a three-stage training pipeline—alignment, high-quality data injection, and joint fine-tuning for images, multi-image, and video tasks.
Minimalist Design:
- Combines a SigLIP encoder, a simple projection layer, and a powerful language model for efficient multi-modal learning.
No comments:
Post a Comment