Yarki's hometown: Strawberry🍓

* How many "r"s in the word "strawberry"？

The Strawberry Rumor

On August 8, Sam Altman, the CEO of OpenAI, posted a cryptic tweet featuring a photo of strawberry plants in his garden, along with the caption “I love summer in the garden.”

Speculation soon began swirling around the idea that this tweet might be a playful hint at a new AI model under Project Strawberry—a project rumored to be the next step toward GPT-5. In previous rumors, Project Strawberry was referred to as “Project Q*,” suggesting an experimental or transitional initiative within OpenAI’s pipeline.

OpenAI o1

On September 12, OpenAI unveiled a model known as “o1.”

the official page: https://openai.com/index/learning-to-reason-with-llms/

This new model, trained with Reinforcement Learning techniques, is designed to excel at complex reasoning tasks such as mathematics and coding. What sets o1 apart from many of its predecessors is its capacity to “think before answering”—it produces a longer hidden chain of thought internally before presenting an answer to the user.

LLM Basics: The Previous Paradigm

Traditional Large Language Model (LLM) pipelines typically follow three stages:

1. Pre-training – Train on massive text corpora (the scaling law: bigger models + more data = better performance).

2. Post-training – Often includes techniques like Reinforcement Learning from Human Feedback (RLHF) to refine the base model into a chatbot or specialized assistant.

3. Inference – The final next-token prediction process for user queries: “The sky is [ ] …”

o1’s paradigm

Smaller Model Size: Compared to GPT-4 or Llama 3’s largest variants (70B/405B parameters), o1 might be relatively small.
Chain-of-Thought Reasoning: o1 can generate a hidden “chain of thought” (CoT) before producing an answer, indicative of its focus on multi-step reasoning.
Built via Reinforcement Learning: While RLHF is used in many LLMs, o1 supposedly employs additional RL techniques, potentially going beyond simple hu
man feedback loops.
Task-Specific Strength: Although powerful in math and coding challenges, o1 is reportedly less adept than GPT-4o in tasks like personal writing or broad creative composition.

What’s the Secret Sauce?

Reinforcement Learning (RL) vs. RLHF

A key talking point in the community is the difference between true RL and RLHF:

1. RL Is Powerful

Consider AlphaGo, which was trained with real self-play and Monte Carlo Tree Search (MCTS), exploring thousands of potential moves before deciding on the next best move.
Go is a clear example of a game with a final win/loss outcome, analogous to reasoning tasks where a solution is either correct or incorrect.

2. RLHF Is Not

RLHF typically means human annotators rate or rank the model outputs, providing a reward signal.
While it can guide a model to produce more coherent or polite responses, it lacks the self-play and deep exploration features that characterize classic RL.

(If you want to be super-human, human feedback is not enough.)

What if AlphaGo were trained purely with RLHF? Human evaluators would have to label or rank each move. This would be enormously labor-intensive and might miss the power of self-play. Hence the distinction: RLHF is partially “RL,” but does not harness the full potential of iterative self-improvement that true RL can offer.

Tree-of-Thought, Multi-Agent Debate, and Verifiers

In large-scale reasoning tasks—especially math or coding—a linear Chain-of-Thought might not be enough. Some hypothesize that o1 could be using:

1. Tree-of-Thought Exploration

Rather than generating a single chain of reasoning, the model expands multiple branches, evaluating different solution paths in parallel, akin to MCTS.

[https://jyopari.github.io/MCTS.html]

[https://arxiv.org/abs/2305.10601]

2. Multi-Agent Debate

• Multiple “agents” (or multiple copies of the model) could debate or verify each other’s answers, leading to more robust final solutions.

• The “infinite monkey theorem” suggests that, given enough random attempts, a correct solution might eventually appear. But we need an effective verifier to pick the correct one out of many.

• Majority voting works if the model’s error patterns are random, but in practice, errors can cluster. A specialized verifier is thus potentially more reliable.

[https://arxiv.org/abs/2407.21787]

Key Points on o1

1. Small yet Powerful: Despite its smaller size compared to GPT-4, o1 excels in math and coding tasks by generating a hidden chain of thought before providing an answer.

2. Reinforcement Learning Focus: Goes beyond standard RLHF, possibly integrating true RL methods (self-play, multi-agent debate) to refine its reasoning.

Yarki's hometown

Sunday, October 6, 2024

Strawberry🍓