Friday, August 16, 2024

LLaVA’s Journey: From LLaVA to LLaVA-1.5 and LLaVA-NEXT (1.6)

Large Language and Vision Assistant(LLaVA) has evolved through several versions, each bringing improvements in model architecture and dataset diversity.

If you want to quickly deploy LLaVA, try the Hugging Face versions:

👉 LLaVA on Hugging Face

1. LLaVA (Initial Release)

  • GitHubLLaVA
  • PaperVisual Instruction Tuning (NeurIPS 2023) → arXiv
  • Description: The first version of LLaVA introduced vision-language alignment using large language models (LLMs) and an image encoder.

2. LLaVA-1.5

  • PaperImproved Baselines with Visual Instruction Tuning (CVPR 2024) → arXiv
  • Improvements: Trained on a wider range of visual instruction datasets for better generalization.
  • Available Models:
    • llava-hf/llava-1.5-7b-hf 
    • llava-hf/llava-1.5-13b-hf 

3. LLaVA-NeXT (LLaVA-1.6)

  • Updates:
    • Higher image resolution
    • Expanded reasoning and OCR datasets
    • More architecture variations
  • Available Model Variants:
    • llava-hf/llava-v1.6-mistral-7b-hf
    • llava-hf/llava-v1.6-vicuna-7b-hf
    • llava-hf/llava-v1.6-vicuna-13b-hf
    • llava-hf/llava-v1.6-34b-hf
    • llava-hf/llama3-llava-next-8b-hf
    • llava-hf/llava-next-72b-hf
    • llava-hf/llava-next-110b-hf

---------------------------------------------------------------------

Now, if you want to continue exploring what happened with LLaVA, please look ahead.

From LLaVA to LLaVA‑1.5

Typical LVLM Architecture: Vision Encoder + Connector + Language Decoder

LLaVA follows a common architecture for large vision-language models (LVLMs): a pre-trained visual encoder processes images, a connector aligns visual features to text embeddings, and a pre-trained language decoder generates text responses.

Visual Encoder: CLIP ViT‑L/14

  • Developed by OpenAI, CLIP (Contrastive Language–Image Pre-training) employs contrastive learning to align images and text within a shared embedding space.

  • Trained on a vast dataset of 400 million image-text pairs, enabling robust open-set recognition capabilities.

  • Processes images by dividing them into non-overlapping patches, each measuring 14×14 pixels, resulting in a total of 256 patches for a 224×224 pixel input image.

Language Decoder: Vicuña‑7B/13B

  • Fine-tuned from the LLaMA base model using 70K user-shared ChatGPT conversations.
  • Reportedly achieves over 90% of ChatGPT’s quality.
  • The overall size of LLaVA is primarily determined by this language model (i.e., ~7B or ~13B parameters).
  • High-quality data for post-training is important!

Connector: A Learnable Layer Aligning Vision and Text

  • Converts the visual encoder’s feature dimension (e.g. 768) to match the language model’s token embedding dimension (e.g. 4096).
  • Initially a linear projection in LLaVA; later improved to an MLP for better alignment in LLaVA‑1.5.

How LLaVA Is Trained

LLaVA introduced a new way to create multimodal instruction-following data—one of the key innovations for its success.

1. Data Generation with GPT‑4

  • Prompt text-only GPT‑4 to simulate “image-based” instructions.
  • Provide GPT‑4 with detailed text descriptions, bounding boxes, and sample Q&A pairs.
  • Manually add a few examples to guide GPT‑4, which then generates ~158K instruction-answer pairs.

2. Two-Stage Training (with Supervised Fine-Tuning)

  • Stage 1: Feature Alignment (Connector Pre-training)
    • Use ~595K image-text pairs (e.g., from Conceptual Captions) to train the connector so that CLIP’s outputs align properly with Vicuña’s text embeddings.
    • This step typically completes in about 4 hours on 8× A100 GPUs.
  • Stage 2: Instruction Fine-Tuning
    • Fine-tune the entire model on 158K multimodal Q&A data plus the ScienceQA dataset.
    • Instruction fine-tuning takes around 10 hours, while the ScienceQA step takes ~4 hours.

Resource Efficiency

LLaVA’s impressive results come from relatively modest resources and data sizes. This highlights the power of high-quality instruction datasets for quickly boosting model performance.

Transition to LLaVA‑1.5

Building on LLaVA’s success, LLaVA‑1.5 introduced multiple enhancements to both data and model structure:

  Data Updates

  • Increased variety in training prompts (more VQA datasets, better prompt formatting).
  • Included Optical Character Recognition (OCR) data to handle text in images.

  Resolution Scaling

  • Boosted input resolution from 224×224 up to 336×336, allowing the model to capture more visual detail.

  Connector Upgrade

  • Moved from a simple linear projection to a more expressive MLP, improving alignment between CLIP and Vicuña.

  Data Efficiency

  • A finding: randomly downsampling LLaVA’s training mixture by up to 75% does not significantly reduce performance. This again suggests the high-quality data is important for LLM/LVLM post-training.

---------------------------------------------------------------------

LLaVA‑NEXT: Key updates

  Higher Input Resolution:

  • LLaVA-NeXT increases the input image resolution to 4× more pixels, supporting versatile aspect ratios (up to 672×672, 336×1344, and 1344×336). This enhancement allows the model to capture finer visual details.

  Enhanced Visual Reasoning & OCR:

  • With an improved visual instruction tuning data mixture, LLaVA-NeXT delivers superior reasoning and OCR capabilities—vital for more accurate and robust multimodal understanding.

  Expanded Capabilities:

  • The model exhibits better visual conversation skills, broader world knowledge, and improved logical reasoning, making it effective across a wider range of applications.

  Efficient Deployment:

  • Despite its advanced features, LLaVA-NeXT maintains the minimalist design and data efficiency of LLaVA‑1.5. For instance, the largest 34B variant completes training in about 1 day on 32 A100 GPUs, demonstrating a highly cost-effective training process.

For more details, please refer to the LLaVA-NeXT blog.

Key points on LLaVA

  Architecture:

  • A classic vision-language setup combining a CLIP-based visual encoder and a language decoder with a connector.

  High-Quality Instruction Data:

  • Uses novel GPT‑4-generated, multimodal instruction data, emphasizing the importance of quality over quantity.

  Two-Stage Training:

  • Involves connector pre-training on ~595K image-text pairs followed by supervised fine-tuning on 158K multimodal Q&A samples (plus ScienceQA).

No comments:

Post a Comment

s1: Simple Test-Time Scaling paper explained

  [ https://arxiv.org/pdf/2501.19393 ] Published on February 3rd , from researchers at Stanford, University of Washington, Allen Institute f...