If you want to quickly deploy LLaVA, try the Hugging Face versions:
1. LLaVA (Initial Release)
- GitHub: LLaVA
- Paper: Visual Instruction Tuning (NeurIPS 2023) → arXiv
- Description: The first version of LLaVA introduced vision-language alignment using large language models (LLMs) and an image encoder.
2. LLaVA-1.5
- Paper: Improved Baselines with Visual Instruction Tuning (CVPR 2024) → arXiv
- Improvements: Trained on a wider range of visual instruction datasets for better generalization.
- Available Models:
- llava-hf/llava-1.5-7b-hf
- llava-hf/llava-1.5-13b-hf
3. LLaVA-NeXT (LLaVA-1.6)
- Updates:
- Higher image resolution
- Expanded reasoning and OCR datasets
- More architecture variations
- Available Model Variants:
- llava-hf/llava-v1.6-mistral-7b-hf
- llava-hf/llava-v1.6-vicuna-7b-hf
- llava-hf/llava-v1.6-vicuna-13b-hf
- llava-hf/llava-v1.6-34b-hf
- llava-hf/llama3-llava-next-8b-hf
- llava-hf/llava-next-72b-hf
- llava-hf/llava-next-110b-hf
---------------------------------------------------------------------
Now, if you want to continue exploring what happened with LLaVA, please look ahead.
From LLaVA to LLaVA‑1.5
Typical LVLM Architecture: Vision Encoder + Connector + Language Decoder
LLaVA follows a common architecture for large vision-language models (LVLMs): a pre-trained visual encoder processes images, a connector aligns visual features to text embeddings, and a pre-trained language decoder generates text responses.
• Visual Encoder: CLIP ViT‑L/14
-
Developed by OpenAI, CLIP (Contrastive Language–Image Pre-training) employs contrastive learning to align images and text within a shared embedding space.
Trained on a vast dataset of 400 million image-text pairs, enabling robust open-set recognition capabilities.
- Processes images by dividing them into non-overlapping patches, each measuring 14×14 pixels, resulting in a total of 256 patches for a 224×224 pixel input image.
• Language Decoder: Vicuña‑7B/13B
- Fine-tuned from the LLaMA base model using 70K user-shared ChatGPT conversations.
- Reportedly achieves over 90% of ChatGPT’s quality.
- The overall size of LLaVA is primarily determined by this language model (i.e., ~7B or ~13B parameters).
High-quality data for post-training is important!
• Connector: A Learnable Layer Aligning Vision and Text
- Converts the visual encoder’s feature dimension (e.g. 768) to match the language model’s token embedding dimension (e.g. 4096).
- Initially a linear projection in LLaVA; later improved to an MLP for better alignment in LLaVA‑1.5.
How LLaVA Is Trained
LLaVA introduced a new way to create multimodal instruction-following data—one of the key innovations for its success.
1. Data Generation with GPT‑4
- Prompt text-only GPT‑4 to simulate “image-based” instructions.
- Provide GPT‑4 with detailed text descriptions, bounding boxes, and sample Q&A pairs.
- Manually add a few examples to guide GPT‑4, which then generates ~158K instruction-answer pairs.
2. Two-Stage Training (with Supervised Fine-Tuning)
- Stage 1: Feature Alignment (Connector Pre-training)
- Use ~595K image-text pairs (e.g., from Conceptual Captions) to train the connector so that CLIP’s outputs align properly with Vicuña’s text embeddings.
- This step typically completes in about 4 hours on 8× A100 GPUs.
- Stage 2: Instruction Fine-Tuning
- Fine-tune the entire model on 158K multimodal Q&A data plus the ScienceQA dataset.
- Instruction fine-tuning takes around 10 hours, while the ScienceQA step takes ~4 hours.
Resource Efficiency
LLaVA’s impressive results come from relatively modest resources and data sizes. This highlights the power of high-quality instruction datasets for quickly boosting model performance.
Transition to LLaVA‑1.5
Building on LLaVA’s success, LLaVA‑1.5 introduced multiple enhancements to both data and model structure:
Data Updates
- Increased variety in training prompts (more VQA datasets, better prompt formatting).
- Included Optical Character Recognition (OCR) data to handle text in images.
Resolution Scaling
- Boosted input resolution from 224×224 up to 336×336, allowing the model to capture more visual detail.
Connector Upgrade
- Moved from a simple linear projection to a more expressive MLP, improving alignment between CLIP and Vicuña.
Data Efficiency
- A finding: randomly downsampling LLaVA’s training mixture by up to 75% does not significantly reduce performance. This again suggests the high-quality data is important for LLM/LVLM post-training.
---------------------------------------------------------------------
LLaVA‑NEXT: Key updates
Higher Input Resolution:
- LLaVA-NeXT increases the input image resolution to 4× more pixels, supporting versatile aspect ratios (up to 672×672, 336×1344, and 1344×336). This enhancement allows the model to capture finer visual details.
Enhanced Visual Reasoning & OCR:
- With an improved visual instruction tuning data mixture, LLaVA-NeXT delivers superior reasoning and OCR capabilities—vital for more accurate and robust multimodal understanding.
Expanded Capabilities:
- The model exhibits better visual conversation skills, broader world knowledge, and improved logical reasoning, making it effective across a wider range of applications.
Efficient Deployment:
- Despite its advanced features, LLaVA-NeXT maintains the minimalist design and data efficiency of LLaVA‑1.5. For instance, the largest 34B variant completes training in about 1 day on 32 A100 GPUs, demonstrating a highly cost-effective training process.
For more details, please refer to the LLaVA-NeXT blog.
Key points on LLaVA
Architecture:
- A classic vision-language setup combining a CLIP-based visual encoder and a language decoder with a connector.
High-Quality Instruction Data:
- Uses novel GPT‑4-generated, multimodal instruction data, emphasizing the importance of quality over quantity.
Two-Stage Training:
- Involves connector pre-training on ~595K image-text pairs followed by supervised fine-tuning on 158K multimodal Q&A samples (plus ScienceQA).
No comments:
Post a Comment