DEEPSEEKMath-V2
sudo jajosThe Full Iterative Cycle — Every Input, Every Output, Every Transformation
The Cast of Characters (Models)
Before the cycle, understand who the players are:
Symbol Name What it does
πφ Verifier Takes (problem, proof) → outputs critique + score
πη Meta-Verifier Takes (problem, proof, critique) → quality score of the critique
πθ Generator Takes (problem) → outputs proof
All three are initialized from the same base: DeepSeek-V3.2-Exp-SFT (a model already SFT'd on math and code reasoning data). They diverge through separate RL training.
PHASE 1 — Build the Verifier πφ
Step 1A — Cold Start Data Construction
INPUT:
- 17,503 AoPS contest problems (𝒟ₚ)
- DeepSeek-V3.2-Exp-Thinking (a weak, pre-optimization model)
- Human math experts
PROCESS:
1. Feed each problem to the weak model → generate candidate proofs
(multiple rounds of self-refinement prompted, since model is error-prone)
2. Sample a subset of (problem, proof) pairs across diverse types
3. Human experts score each proof: s ∈ {0, 0.5, 1}
OUTPUT:
𝒟ᵥ = { (Xᵢ, Yᵢ, sᵢ) }
Each item = one problem, one proof, one expert score
This is your ground truth for what good vs. bad proofs look like.
Step 1B — Train Verifier v1 (Basic RL)
INPUT:
- Base model: DeepSeek-V3.2-Exp-SFT
- Dataset 𝒟ᵥ = { (problem, proof, expert_score) }
- Rubrics 𝓘ᵥ (evaluation guidelines, in the prompt)
RL TRAINING LOOP (GRPO):
For each training sample (X, Y, s):
1. Feed (problem X, proof Y, rubrics 𝓘ᵥ) into the model
2. Model generates response V' containing:
- Natural language critique ("The issue is that step 3 skips...")
- A predicted score s' ∈ {0, 0.5, 1} inside \boxed{}
3. Compute reward:
R_format = 1 if output has required structure, else 0
R_score = 1 - |s' - s| ← continuous, 0 to 1
R_total = R_format × R_score ← gated: no structure = no reward
4. GRPO updates model weights to increase probability of high-reward outputs
OUTPUT:
πφ (v1) — a verifier that predicts correct scores
⚠️ PROBLEM: it may hallucinate fake issues to justify its scores
Step 1C — Train the Meta-Verifier πη
INPUT:
- πφ (v1) — the verifier just trained above
- 𝒟ᵥ — the same problems and proofs
- Human experts (again)
PROCESS:
1. Run πφ (v1) on each (X, Y) → produces critique Vᵢ
2. Human experts score the QUALITY of each critique: ms ∈ {0, 0.5, 1}
(Not "was the proof score right?" but "were the identified issues real
and do they actually justify the score?")
3. This creates 𝒟ₘᵥ = { (Xᵢ, Yᵢ, Vᵢ, msᵢ) }
RL TRAINING (same GRPO structure):
Feed (problem X, proof Y, critique V, meta-rubrics 𝓘ₘᵥ) into a fresh model
Model outputs: meta-critique + meta quality score
Reward = same format × score structure, but now "score" = meta quality score
OUTPUT:
πη — a meta-verifier that outputs R_meta ∈ [0, 1]
This is the model that catches hallucinated issues
Step 1D — Retrain Enhanced Verifier πφ (v2)
INPUT:
- Base model: DeepSeek-V3.2-Exp-SFT (fresh start)
- 𝒟ᵥ AND 𝒟ₘᵥ (both datasets)
- Trained meta-verifier πη (now frozen, used only as a reward signal)
RL TRAINING:
For each (X, Y, s) from 𝒟ᵥ:
1. Model generates critique V' and predicted score s'
2. Compute rewards:
R_format = structure check (same as before)
R_score = 1 - |s' - s|
R_meta = πη( X, Y, V' ) → quality score of the critique
R_total = R_format × R_score × R_meta ← NEW: meta gates everything
3. GRPO update
For each (X, Y, V, ms) from 𝒟ₘᵥ:
Same training, but now the model also learns to DO meta-verification
OUTPUT:
πφ (v2) — one model that can do BOTH:
• Verify proofs (score 0/0.5/1 with real, faithful reasoning)
• Meta-verify critiques (score the quality of any analysis)
Quality score: 0.85 → 0.96 on validation set
Proof score accuracy: unchanged ✓
PHASE 2 — Train the Generator πθ
INPUT:
- Base model: DeepSeek-V3.2-Exp-SFT (fresh start again)
- Problem set 𝒟ₚ (all 17,503 AoPS problems)
- Trained verifier πφ (v2) — now FROZEN, used only as reward signal
RL TRAINING (GRPO):
For each problem Xᵢ:
1. Generator produces proof Yᵢ
2. Feed (Xᵢ, Yᵢ) to frozen πφ → get score R_Y ∈ {0, 0.5, 1}
3. R_Y is the entire reward signal
4. GRPO updates generator to produce higher-scoring proofs
OUTPUT:
πθ (v1) — a proof generator trained to write proofs
that satisfy the verifier's rubrics
Key property: reward signal exists for ALL proof problems,
not just ones with numerical answers
PHASE 3 — Self-Verification: Generator Internalizes the Verifier
This is where it gets deep. Instead of always calling the external verifier, the generator is now trained to apply verification inside its own reasoning.
INPUT:
- πθ (v1) — trained generator
- πφ (v2) — trained verifier
- Problem set 𝒟ₚ
TRAINING SETUP:
The model is trained on trajectories of the form:
[Draft proof] → [Self-critique] → [Revised proof] → [Self-critique] → ... → [Final proof]
The reward is assigned on the FINAL proof only.
So the model learns: spending compute on self-critique is worth it
if the final proof improves.
WHAT THE MODEL LEARNS TO DO (at inference):
1. Write a draft proof for problem X
2. Apply verifier logic to its own draft:
"Step 3 assumes Y is bounded, but I never proved that..."
3. Revise the proof to fix the identified issue
4. Check again: "Is there anything else wrong?"
5. Repeat until it can no longer find any issues
6. Output the final proof
OUTPUT:
πθ (v2) — a generator that can self-verify and self-refine
The model now "knows" its reward function and can deliberately
optimize toward it, not through random sampling but through
targeted self-correction
PHASE 4 — Maintaining the Gap: Scaling Verification Compute
As πθ gets stronger, the proofs it generates become harder to evaluate correctly. The verifier's training data (from the weak early generator) no longer covers the kinds of subtle near-correct proofs the better generator now produces.
PROBLEM:
Strong generator produces hard proofs
→ Verifier hasn't seen this difficulty level
→ Verifier starts making wrong calls
→ Bad reward signal → generator training degrades
SOLUTION — Auto-labeling hard proofs:
INPUT:
- New hard proofs from the improved πθ
- πφ (v2) — current verifier
PROCESS:
For each hard new proof Y on problem X:
Run πφ many times (scale compute)
→ get many score samples {s₁, s₂, s₃, ...}
→ aggregate into a robust consensus score s*
This gives you (X, Y, s*) without human annotation
OUTPUT:
New expanded 𝒟ᵥ with hard-to-verify proofs included
→ Retrain πφ → get πφ (v3) that handles harder proofs
→ Use πφ (v3) to train πθ further → πθ (v3)
→ Repeat
The Complete Flywheel, All Together
┌─────────────────────────────┐
│ HUMAN EXPERTISE (once) │
│ Score proofs + critiques │
│ 𝒟ᵥ and 𝒟ₘᵥ created │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ PHASE 1: BUILD VERIFIER │
│ v1 RL → Meta-verifier RL │
│ v2 RL (with R_meta signal) │
│ → πφ: accurate + faithful │
└──────────────┬──────────────┘
│ reward signal
┌──────────────▼──────────────┐
│ PHASE 2: TRAIN GENERATOR │
│ πθ RL using πφ as reward │
│ → can write good proofs │
└──────────────┬──────────────┘
│ internalize verifier
┌──────────────▼──────────────┐
│ PHASE 3: SELF-VERIFICATION │
│ Train on refine trajectories│
│ → πθ self-critiques + fixes │
└──────────────┬──────────────┘
│ generator now stronger
┌──────────────▼──────────────┐
│ PHASE 4: SCALE VERIFY │
│ New hard proofs auto-labeled│
│ → πφ retrained on harder data│
└──────────────┬──────────────┘
│ verifier now stronger
└──► back to PHASE 2
Each lap around this loop: both the verifier and generator get stronger. Human annotation is only needed once to bootstrap. After that, the system is self-sustaining.