OPERA is a decoding method for multimodal LLMs, aiming to mitigate hallucinations by discouraging the model from over-trusting certain “summary” tokens and providing a fallback mechanism if a partial over-trust pattern emerges. It requires no extra training data or model fine-tuning, yet significantly reduces hallucination.
| [https://arxiv.org/pdf/2311.17911] |
Definition
Anchor Patterns/Knowledge Aggregation Patterns
Positive Correlation with Hallucinations
Figure 4 shows that more anchor tokens (visible as column-like attention patterns) correlate with increased hallucinations. This suggests these aggregation patterns directly contribute to factual errors in image descriptions rather than being merely harmless computational artifacts.
OPERA: A Beam Search-Based Decoding (Proposed Method)
OPERA, modifies standard Beam Search by incorporating two main tricks: (If you need a quick refresher on Beam Search, check some YouTube videos.)
1. Over-Trust Logit Penalty
2. Retrospection-Allocation Strategy
Together, these components discourage the model from following an anchor token’s lead and allow it to “roll back” if the partial over-trust pattern becomes too strong.
Over-Trust Logit Penalty
- Local Window on Self-Attention:
- We examine attention weights (ω) for recent tokens, using a size k window to analyze how recent tokens interact with each other.
- Pattern Identification:
- After eliminating forward attention (upper triangle), we apply scaling factor σ to highlight attention patterns.
- Column-wise multiplication reveals tokens that accumulate excessive influence (potential anchors).
- Measuring Over-Reliance:
- The maximum column product φ(ω<t) quantifies the model's over-dependence on individual tokens.
- Dynamic Correction:
- During generation, we penalize next-token predictions by subtracting α · φ(ω<t) from logits.
- Mathematically:
p(xt | x<t) = Softmax[H(ht) - α·φ(ω<t)]xt
- This encourages the model to consider broader context rather than fixating on single influential tokens.
2. Retrospection-Allocation Strategy
Even with logit penalties, all beam candidates may develop the same over-trust patterns. Our rollback approach addresses this:
- Pattern Recognition:
- We monitor maximum column-wise score positions in recent tokens (window size l). When the same anchor location appears frequently (≥ r times), we identify a persistent over-trust pattern.
- Strategic Reselection:
- For an anchor at position s, we revert to the sequence before position s+1 was generated.
- We then select an alternative token from the candidates, excluding the previously chosen one.
- This process repeats up to β times to prevent excessive rollbacks.
Experiments
Configure all methods with their default settings. OPERA uses Beam Search (Nbeam = 5) enhanced with Over-Trust Penalty and Retrospection-Allocation mechanisms.
Implementation Details
- Scaling Factor (σ): 50, ensuring strong anchor tokens produce products > 1, while weaker attention patterns remain < 1.
- Candidate Count (Ncan): Default 5; higher values improve exploration at increased computational cost.
- Standard Parameters: α = 1, β = 5, and r = 15 remain consistent across all MLLM implementations.
Key points on OPERA
- Anchor Detection: OPERA identifies column-like attention patterns focused on punctuation or short words, preventing over-reliance on these tokens at the expense of visual information.
- Attention Penalty: By penalizing candidate tokens that excessively depend on anchors, OPERA reduces the continuation of hallucinated narratives once problematic patterns emerge.
- Strategic Backtracking: When all beams fixate on the same anchor, OPERA rolls back and chooses alternative generation paths, effectively resetting the model's focus.
Additional Experiment (Bonus)
Setup:
- Input: [image] ⬇️
- Model: LLaVA-1.5-7B
- Sample Caption: In the image, a zebra is standing in a field with dry leaves scattered around. It appears to be grazing on the leaves, possibly searching for food. Apart from the zebra, there are a few other animals in the scene, including a horse and a cow.
- Let's examine the attention map. The coordinates represent each token in the caption. After one forward pass, we visualize how each token attends to previous tokens.
- We can observe the phenomena identified in OPERA. The orange boxes highlight how aggregation corresponds to hallucinated tokens, while yellow boxes show aggregation on periods. The leftmost column displays attention weights on image tokens (sum of all image token weights).
A fair conclusion is that aggregation is a nature of autoregressive next-token prediction LLM when generating contextual text. While it becomes pronounced with hallucinated tokens, the relationship is not so obvious.
Regarding attention weights on the image tokens, the following examples labeled [index]_[token] show that weights on image tokens are generally substantial regardless of it is describing an exist object or a hallucinated object. However, accurate descriptions correlate with more precise distribution of attention on relevant objects—an interesting observation, hope you enjoy that :).
| [5_z(ebra)] |
[54_horse]
|
No comments:
Post a Comment