| [https://arxiv.org/pdf/2501.19393] |
s1 introduces a simple yet effective method to boost model reasoning by scaling test-time compute. Its key innovation is a technique called budget forcing that regulates the length of the model’s reasoning process.
Scaling Test-Time Compute: Sequential vs. Parallel
1. Sequential Scaling via Budget Forcing (proposed method)
The Idea:
- Budget forcing intercepts the model’s attempt to finish reasoning. When it generates an end-of-thinking token, that token is replaced (e.g., with “Wait”), forcing the model to continue until a set token limit is reached, after which the final answer is produced.
How It Works:
Prevent Early Stopping: When the model tries to stop thinking before reaching its token limit, we catch the end-of-thinking token and swap it with something like “Wait” so it keeps going.
Final Answer Trigger: After using up the token budget, add a marker like “Final Answer:” to tell the model to stop reasoning and give its final response.
2. Parallel Scaling via Majority Voting:
- Generates multiple candidate answers and then aggregates the results—often via majority voting—pick the most common answer. This uses more parallel compute rather than extending any single reasoning path.
As shown in Table 4,
- (a) Sequential Scaling with Budget Forcing: The model is prevented from stopping its reasoning. If it tries to end, we append “Wait,” effectively making it think 2×, 4×, or 6× longer. Performance improves as the model reasons more, but eventually flattens out (saturation).
- (b) Parallel Scaling via Majority Voting: Run the base model multiple times—2, 4, 8, 16, 32, or 64 parallel samples—and use majority voting to decide on the final output. Although more samples can help, it does not always match the gains from sequential scaling.
Fine-Tuning with s1K
Curating the s1K Dataset
1. Initial Pool (59K): The authors start with 59,000 questions spanning math, science, and logic. They discard duplicates, poorly formatted samples, and trivial problems (solvable by smaller models).
2. Filtering Criteria:
- Quality: Remove questions with formatting errors or unclear solutions, keeping only well-structured problems.
- Difficulty: Exclude easy questions by checking whether Qwen2.5-7B or Qwen2.5-32B could already solve them. Longer or multi-step solutions also indicate higher difficulty.
- Diversity: Ensure coverage across various domains (e.g., geometry, physics, number theory) to create a broad, representative set of problems.
Data Format for Supervised Fine-Tuning (SFT)
• Each sample includes:
- Question: The prompt or problem statement.
- Reasoning Trace: A detailed, multi-step chain of thought generated or validated for correctness.
- Final Answer: The concise solution, typically a short string or numeric result.
• During fine-tuning, the model sees (Question + Reasoning Trace) as input and is trained to generate the Final Answer token sequence.
Fine-Tuning Process & Resource Efficiency
- The base model is Qwen2.5-32B-Instruct.
- They perform a Supervised Fine-Tuning (SFT) on s1K (question + reasoning + answer) samples.
- This entire process takes only 26 minutes on 16 H100 GPUs, showcasing that even a small, carefully curated dataset can significantly enhance reasoning performance with minimal training time.
Key Points on s1
• Single-Path Test-Time Scaling: A simple trick—preventing the model from stopping soon—enables effective scaling on one reasoning chain, outperforming parallel methods.
• Efficient Data: The s1K dataset shows how a small, high-quality set of questions and model-generated reasoning traces can significantly boost reasoning performance.
Comment: Automated, high-quality data generation without human annotations is a promising trend. Better quality fine-tuning data and more effective loss functions (GRPO) may be the way forward instead of RLHF. RLHF is not what we need. It’s important to note that the success of s1K might be partly due to using the model’s own reasoning paths. When generating high-quality fine-tuning data, a model-specific factor could be key in future advancements.


