Yarki's hometown: October 2024

OPERA is a decoding method for multimodal LLMs, aiming to mitigate hallucinations by discouraging the model from over-trusting certain “summary” tokens and providing a fallback mechanism if a partial over-trust pattern emerges. It requires no extra training data or model fine-tuning, yet significantly reduces hallucination.

[https://arxiv.org/pdf/2311.17911]

Definition

Anchor Patterns/Knowledge Aggregation Patterns

Modern LLMs develop "anchor patterns" in their attention mechanisms. Instead of processing all previous tokens equally, they focus on a few key "summary" tokens (often punctuation or short words). These anchors compile information to guide future outputs. However, this selective attention can cause hallucinations when important visual details are missed, leading models to invent non-existent elements like cars or trees that aren't actually in images. （* Note that attention usually aggregate on period tokens(.) Be careful about it.)

Positive Correlation with Hallucinations

Figure 4 shows that more anchor tokens (visible as column-like attention patterns) correlate with increased hallucinations. This suggests these aggregation patterns directly contribute to factual errors in image descriptions rather than being merely harmless computational artifacts.

OPERA: A Beam Search-Based Decoding (Proposed Method)

OPERA, modifies standard Beam Search by incorporating two main tricks: (If you need a quick refresher on Beam Search, check some YouTube videos.)

1. Over-Trust Logit Penalty

2. Retrospection-Allocation Strategy

Together, these components discourage the model from following an anchor token’s lead and allow it to “roll back” if the partial over-trust pattern becomes too strong.

Over-Trust Logit Penalty

Local Window on Self-Attention:
- We examine attention weights (ω) for recent tokens, using a size k window to analyze how recent tokens interact with each other.
Pattern Identification:
- After eliminating forward attention (upper triangle), we apply scaling factor σ to highlight attention patterns.
- Column-wise multiplication reveals tokens that accumulate excessive influence (potential anchors).
Measuring Over-Reliance:
- The maximum column product φ(ω_<t) quantifies the model's over-dependence on individual tokens.
Dynamic Correction:

During generation, we penalize next-token predictions by subtracting α · φ(ω_<t) from logits.
Mathematically:
p(x_t | x_<t) = Softmax[H(h_t) - α·φ(ω_<t)]_{x_t}
This encourages the model to consider broader context rather than fixating on single influential tokens.

Retrospection-Allocation Strategy

2. Retrospection-Allocation Strategy

Even with logit penalties, all beam candidates may develop the same over-trust patterns. Our rollback approach addresses this:

Pattern Recognition:
- We monitor maximum column-wise score positions in recent tokens (window size l). When the same anchor location appears frequently (≥ r times), we identify a persistent over-trust pattern.
Strategic Reselection:

For an anchor at position s, we revert to the sequence before position s+1 was generated.
We then select an alternative token from the candidates, excluding the previously chosen one.
This process repeats up to β times to prevent excessive rollbacks.

Through this two-step approach, OPERA reduces hallucinations by preventing premature fixation on summary tokens and strategically backtracking when necessary to break established anchor patterns.

Experiments

Configure all methods with their default settings. OPERA uses Beam Search (N_beam = 5) enhanced with Over-Trust Penalty and Retrospection-Allocation mechanisms.

Implementation Details

Scaling Factor (σ): 50, ensuring strong anchor tokens produce products > 1, while weaker attention patterns remain < 1.
Candidate Count (N_can): Default 5; higher values improve exploration at increased computational cost.
Standard Parameters: α = 1, β = 5, and r = 15 remain consistent across all MLLM implementations.

Key points on OPERA

Anchor Detection: OPERA identifies column-like attention patterns focused on punctuation or short words, preventing over-reliance on these tokens at the expense of visual information.
Attention Penalty: By penalizing candidate tokens that excessively depend on anchors, OPERA reduces the continuation of hallucinated narratives once problematic patterns emerge.
Strategic Backtracking: When all beams fixate on the same anchor, OPERA rolls back and chooses alternative generation paths, effectively resetting the model's focus.

---------------------------------------------------------------------

Additional Experiment (Bonus)

Setup:

Input: [image] ⬇️
Model: LLaVA-1.5-7B
Sample Caption: In the image, a zebra is standing in a field with dry leaves scattered around. It appears to be grazing on the leaves, possibly searching for food. Apart from the zebra, there are a few other animals in the scene, including a horse and a cow.

Let's examine the attention map. The coordinates represent each token in the caption. After one forward pass, we visualize how each token attends to previous tokens.
We can observe the phenomena identified in OPERA. The orange boxes highlight how aggregation corresponds to hallucinated tokens, while yellow boxes show aggregation on periods. The leftmost column displays attention weights on image tokens (sum of all image token weights).

However, testing with additional examples reveals that attention aggregation occurs in both hallucination and non-hallucination cases. Even when examining attention to image tokens, there's no significant reduction in image attention during hallucination generation.

A fair conclusion is that aggregation is a nature of autoregressive next-token prediction LLM when generating contextual text. While it becomes pronounced with hallucinated tokens, the relationship is not so obvious.

Regarding attention weights on the image tokens, the following examples labeled [index]_[token] show that weights on image tokens are generally substantial regardless of it is describing an exist object or a hallucinated object. However, accurate descriptions correlate with more precise distribution of attention on relevant objects—an interesting observation, hope you enjoy that :).

[5_z(ebra)]

[54_horse]

[57_cow]

* How many "r"s in the word "strawberry"？

The Strawberry Rumor

On August 8, Sam Altman, the CEO of OpenAI, posted a cryptic tweet featuring a photo of strawberry plants in his garden, along with the caption “I love summer in the garden.”

Speculation soon began swirling around the idea that this tweet might be a playful hint at a new AI model under Project Strawberry—a project rumored to be the next step toward GPT-5. In previous rumors, Project Strawberry was referred to as “Project Q*,” suggesting an experimental or transitional initiative within OpenAI’s pipeline.

OpenAI o1

On September 12, OpenAI unveiled a model known as “o1.”

the official page: https://openai.com/index/learning-to-reason-with-llms/

This new model, trained with Reinforcement Learning techniques, is designed to excel at complex reasoning tasks such as mathematics and coding. What sets o1 apart from many of its predecessors is its capacity to “think before answering”—it produces a longer hidden chain of thought internally before presenting an answer to the user.

LLM Basics: The Previous Paradigm

Traditional Large Language Model (LLM) pipelines typically follow three stages:

1. Pre-training – Train on massive text corpora (the scaling law: bigger models + more data = better performance).

2. Post-training – Often includes techniques like Reinforcement Learning from Human Feedback (RLHF) to refine the base model into a chatbot or specialized assistant.

3. Inference – The final next-token prediction process for user queries: “The sky is [ ] …”

o1’s paradigm

Smaller Model Size: Compared to GPT-4 or Llama 3’s largest variants (70B/405B parameters), o1 might be relatively small.
Chain-of-Thought Reasoning: o1 can generate a hidden “chain of thought” (CoT) before producing an answer, indicative of its focus on multi-step reasoning.
Built via Reinforcement Learning: While RLHF is used in many LLMs, o1 supposedly employs additional RL techniques, potentially going beyond simple hu
man feedback loops.
Task-Specific Strength: Although powerful in math and coding challenges, o1 is reportedly less adept than GPT-4o in tasks like personal writing or broad creative composition.

What’s the Secret Sauce?

Reinforcement Learning (RL) vs. RLHF

A key talking point in the community is the difference between true RL and RLHF:

1. RL Is Powerful

Consider AlphaGo, which was trained with real self-play and Monte Carlo Tree Search (MCTS), exploring thousands of potential moves before deciding on the next best move.
Go is a clear example of a game with a final win/loss outcome, analogous to reasoning tasks where a solution is either correct or incorrect.

2. RLHF Is Not

RLHF typically means human annotators rate or rank the model outputs, providing a reward signal.
While it can guide a model to produce more coherent or polite responses, it lacks the self-play and deep exploration features that characterize classic RL.

(If you want to be super-human, human feedback is not enough.)

What if AlphaGo were trained purely with RLHF? Human evaluators would have to label or rank each move. This would be enormously labor-intensive and might miss the power of self-play. Hence the distinction: RLHF is partially “RL,” but does not harness the full potential of iterative self-improvement that true RL can offer.

Tree-of-Thought, Multi-Agent Debate, and Verifiers

In large-scale reasoning tasks—especially math or coding—a linear Chain-of-Thought might not be enough. Some hypothesize that o1 could be using:

1. Tree-of-Thought Exploration

Rather than generating a single chain of reasoning, the model expands multiple branches, evaluating different solution paths in parallel, akin to MCTS.

[https://jyopari.github.io/MCTS.html]

[https://arxiv.org/abs/2305.10601]

2. Multi-Agent Debate

• Multiple “agents” (or multiple copies of the model) could debate or verify each other’s answers, leading to more robust final solutions.

• The “infinite monkey theorem” suggests that, given enough random attempts, a correct solution might eventually appear. But we need an effective verifier to pick the correct one out of many.

• Majority voting works if the model’s error patterns are random, but in practice, errors can cluster. A specialized verifier is thus potentially more reliable.

[https://arxiv.org/abs/2407.21787]

Key Points on o1

1. Small yet Powerful: Despite its smaller size compared to GPT-4, o1 excels in math and coding tasks by generating a hidden chain of thought before providing an answer.

2. Reinforcement Learning Focus: Goes beyond standard RLHF, possibly integrating true RL methods (self-play, multi-agent debate) to refine its reasoning.

Yarki's hometown

Friday, October 18, 2024

OPERA[CVPR 2024 Highlight] explained + a little follow-up experiment

Definition

Anchor Patterns/Knowledge Aggregation Patterns

Positive Correlation with Hallucinations

OPERA: A Beam Search-Based Decoding (Proposed Method)

Over-Trust Logit Penalty

2. Retrospection-Allocation Strategy

Experiments

Key points on OPERA

Additional Experiment (Bonus)

Setup:

Sunday, October 6, 2024

Strawberry🍓

The Strawberry Rumor

OpenAI o1

LLM Basics: The Previous Paradigm

o1’s paradigm

What’s the Secret Sauce?

Reinforcement Learning (RL) vs. RLHF

2. RLHF Is Not

Tree-of-Thought, Multi-Agent Debate, and Verifiers

1. Tree-of-Thought Exploration

2. Multi-Agent Debate

Key Points on o1

s1: Simple Test-Time Scaling paper explained

Report Abuse