Friday, October 18, 2024

OPERA[CVPR 2024 Highlight] explained + a little follow-up experiment

OPERA is a decoding method for multimodal LLMs, aiming to mitigate hallucinations by discouraging the model from over-trusting certain “summary” tokens and providing a fallback mechanism if a partial over-trust pattern emerges. It requires no extra training data or model fine-tuning, yet significantly reduces hallucination.

[https://arxiv.org/pdf/2311.17911]

Definition

Anchor Patterns/Knowledge Aggregation Patterns

Modern LLMs develop "anchor patterns" in their attention mechanisms. Instead of processing all previous tokens equally, they focus on a few key "summary" tokens (often punctuation or short words). These anchors compile information to guide future outputs. However, this selective attention can cause hallucinations when important visual details are missed, leading models to invent non-existent elements like cars or trees that aren't actually in images. (* Note that attention usually aggregate on period tokens(.) Be careful about it.)

Positive Correlation with Hallucinations

Figure 4 shows that more anchor tokens (visible as column-like attention patterns) correlate with increased hallucinations. This suggests these aggregation patterns directly contribute to factual errors in image descriptions rather than being merely harmless computational artifacts.

OPERA: A Beam Search-Based Decoding (Proposed Method)

OPERA, modifies standard Beam Search by incorporating two main tricks: (If you need a quick refresher on Beam Search, check some YouTube videos.)

1. Over-Trust Logit Penalty

2. Retrospection-Allocation Strategy

Together, these components discourage the model from following an anchor token’s lead and allow it to “roll back” if the partial over-trust pattern becomes too strong.

Over-Trust Logit Penalty

  1. Local Window on Self-Attention:
    • We examine attention weights (ω) for recent tokens, using a size k window to analyze how recent tokens interact with each other.
  2. Pattern Identification:
    • After eliminating forward attention (upper triangle), we apply scaling factor σ to highlight attention patterns.
    • Column-wise multiplication reveals tokens that accumulate excessive influence (potential anchors).
  3. Measuring Over-Reliance:
    • The maximum column product φ(ω<t) quantifies the model's over-dependence on individual tokens.
  4. Dynamic Correction:
    • During generation, we penalize next-token predictions by subtracting α · φ(ω<t) from logits.
    • Mathematically:
      p(xt | x<t) = Softmax[H(ht) - α·φ(ω<t)]xt
    • This encourages the model to consider broader context rather than fixating on single influential tokens.

Retrospection-Allocation Strategy

2. Retrospection-Allocation Strategy

Even with logit penalties, all beam candidates may develop the same over-trust patterns. Our rollback approach addresses this:

  1. Pattern Recognition:
    • We monitor maximum column-wise score positions in recent tokens (window size l). When the same anchor location appears frequently (≥ r times), we identify a persistent over-trust pattern.
  2. Strategic Reselection:
    • For an anchor at position s, we revert to the sequence before position s+1 was generated.
    • We then select an alternative token from the candidates, excluding the previously chosen one.
    • This process repeats up to β times to prevent excessive rollbacks.
Through this two-step approach, OPERA reduces hallucinations by preventing premature fixation on summary tokens and strategically backtracking when necessary to break established anchor patterns.

Experiments

Configure all methods with their default settings. OPERA uses Beam Search (Nbeam = 5) enhanced with Over-Trust Penalty and Retrospection-Allocation mechanisms.

Implementation Details

  • Scaling Factor (σ): 50, ensuring strong anchor tokens produce products > 1, while weaker attention patterns remain < 1.
  • Candidate Count (Ncan): Default 5; higher values improve exploration at increased computational cost.
  • Standard Parameters: α = 1, β = 5, and r = 15 remain consistent across all MLLM implementations.

Key points on OPERA

  1. Anchor Detection: OPERA identifies column-like attention patterns focused on punctuation or short words, preventing over-reliance on these tokens at the expense of visual information.
  2. Attention Penalty: By penalizing candidate tokens that excessively depend on anchors, OPERA reduces the continuation of hallucinated narratives once problematic patterns emerge.
  3. Strategic Backtracking: When all beams fixate on the same anchor, OPERA rolls back and chooses alternative generation paths, effectively resetting the model's focus.
---------------------------------------------------------------------

Additional Experiment (Bonus)

Setup:

  • Input: [image] ⬇️
  • Model: LLaVA-1.5-7B
  • Sample Caption: In the image, a zebra is standing in a field with dry leaves scattered around. It appears to be grazing on the leaves, possibly searching for food. Apart from the zebra, there are a few other animals in the scene, including a horse and a cow

  1. Let's examine the attention map. The coordinates represent each token in the caption. After one forward pass, we visualize how each token attends to previous tokens.
  2. We can observe the phenomena identified in OPERA. The orange boxes highlight how aggregation corresponds to hallucinated tokens, while yellow boxes show aggregation on periods. The leftmost column displays attention weights on image tokens (sum of all image token weights).

However, testing with additional examples reveals that attention aggregation occurs in both hallucination and non-hallucination cases. Even when examining attention to image tokens, there's no significant reduction in image attention during hallucination generation.

A fair conclusion is that aggregation is a nature of autoregressive next-token prediction LLM when generating contextual text. While it becomes pronounced with hallucinated tokens, the relationship is not so obvious.

Regarding attention weights on the image tokens, the following examples labeled [index]_[token] show that weights on image tokens are generally substantial regardless of it is describing an exist object or a hallucinated object. However, accurate descriptions correlate with more precise distribution of attention on relevant objects—an interesting observation, hope you enjoy that :).

[5_z(ebra)]

[54_horse]

[57_cow]




No comments:

Post a Comment

s1: Simple Test-Time Scaling paper explained

  [ https://arxiv.org/pdf/2501.19393 ] Published on February 3rd , from researchers at Stanford, University of Washington, Allen Institute f...