DEEPSEEKMath-V2

sudo jajos

The Full Iterative Cycle — Every Input, Every Output, Every Transformation

The Cast of Characters (Models)

Before the cycle, understand who the players are:

Symbol Name What it does

πφ Verifier Takes (problem, proof) → outputs critique + score

πη Meta-Verifier Takes (problem, proof, critique) → quality score of the critique

πθ Generator Takes (problem) → outputs proof

All three are initialized from the same base: DeepSeek-V3.2-Exp-SFT (a model already SFT'd on math and code reasoning data). They diverge through separate RL training.

PHASE 1 — Build the Verifier πφ

Step 1A — Cold Start Data Construction

INPUT:
  - 17,503 AoPS contest problems (𝒟ₚ)
  - DeepSeek-V3.2-Exp-Thinking (a weak, pre-optimization model)
  - Human math experts

PROCESS:
  1. Feed each problem to the weak model → generate candidate proofs
     (multiple rounds of self-refinement prompted, since model is error-prone)
  2. Sample a subset of (problem, proof) pairs across diverse types
  3. Human experts score each proof: s ∈ {0, 0.5, 1}

OUTPUT:
  𝒟ᵥ = { (Xᵢ, Yᵢ, sᵢ) }
  Each item = one problem, one proof, one expert score

This is your ground truth for what good vs. bad proofs look like.

Step 1B — Train Verifier v1 (Basic RL)

INPUT:
  - Base model: DeepSeek-V3.2-Exp-SFT
  - Dataset 𝒟ᵥ = { (problem, proof, expert_score) }
  - Rubrics 𝓘ᵥ (evaluation guidelines, in the prompt)

RL TRAINING LOOP (GRPO):
  For each training sample (X, Y, s):
    1. Feed (problem X, proof Y, rubrics 𝓘ᵥ) into the model
    2. Model generates response V' containing:
         - Natural language critique ("The issue is that step 3 skips...")
         - A predicted score s' ∈ {0, 0.5, 1} inside \boxed{}
    3. Compute reward:
         R_format = 1 if output has required structure, else 0
         R_score  = 1 - |s' - s|       ← continuous, 0 to 1
         R_total  = R_format × R_score  ← gated: no structure = no reward
    4. GRPO updates model weights to increase probability of high-reward outputs

OUTPUT:
  πφ (v1) — a verifier that predicts correct scores
  ⚠️ PROBLEM: it may hallucinate fake issues to justify its scores

Step 1C — Train the Meta-Verifier πη

INPUT:
  - πφ (v1) — the verifier just trained above
  - 𝒟ᵥ — the same problems and proofs
  - Human experts (again)

PROCESS:
  1. Run πφ (v1) on each (X, Y) → produces critique Vᵢ
  2. Human experts score the QUALITY of each critique: ms ∈ {0, 0.5, 1}
     (Not "was the proof score right?" but "were the identified issues real
      and do they actually justify the score?")
  3. This creates 𝒟ₘᵥ = { (Xᵢ, Yᵢ, Vᵢ, msᵢ) }

RL TRAINING (same GRPO structure):
  Feed (problem X, proof Y, critique V, meta-rubrics 𝓘ₘᵥ) into a fresh model
  Model outputs: meta-critique + meta quality score
  Reward = same format × score structure, but now "score" = meta quality score

OUTPUT:
  πη — a meta-verifier that outputs R_meta ∈ [0, 1]
  This is the model that catches hallucinated issues

Step 1D — Retrain Enhanced Verifier πφ (v2)

INPUT:
  - Base model: DeepSeek-V3.2-Exp-SFT (fresh start)
  - 𝒟ᵥ AND 𝒟ₘᵥ (both datasets)
  - Trained meta-verifier πη (now frozen, used only as a reward signal)

RL TRAINING:
  For each (X, Y, s) from 𝒟ᵥ:
    1. Model generates critique V' and predicted score s'
    2. Compute rewards:
         R_format = structure check (same as before)
         R_score  = 1 - |s' - s|
         R_meta   = πη( X, Y, V' ) → quality score of the critique
         R_total  = R_format × R_score × R_meta  ← NEW: meta gates everything
    3. GRPO update

  For each (X, Y, V, ms) from 𝒟ₘᵥ:
    Same training, but now the model also learns to DO meta-verification

OUTPUT:
  πφ (v2) — one model that can do BOTH:
    • Verify proofs (score 0/0.5/1 with real, faithful reasoning)
    • Meta-verify critiques (score the quality of any analysis)
  
  Quality score: 0.85 → 0.96 on validation set
  Proof score accuracy: unchanged ✓

PHASE 2 — Train the Generator πθ

INPUT:
  - Base model: DeepSeek-V3.2-Exp-SFT (fresh start again)
  - Problem set 𝒟ₚ (all 17,503 AoPS problems)
  - Trained verifier πφ (v2) — now FROZEN, used only as reward signal

RL TRAINING (GRPO):
  For each problem Xᵢ:
    1. Generator produces proof Yᵢ
    2. Feed (Xᵢ, Yᵢ) to frozen πφ → get score R_Y ∈ {0, 0.5, 1}
    3. R_Y is the entire reward signal
    4. GRPO updates generator to produce higher-scoring proofs

OUTPUT:
  πθ (v1) — a proof generator trained to write proofs
             that satisfy the verifier's rubrics
  
  Key property: reward signal exists for ALL proof problems,
  not just ones with numerical answers

PHASE 3 — Self-Verification: Generator Internalizes the Verifier

This is where it gets deep. Instead of always calling the external verifier, the generator is now trained to apply verification inside its own reasoning.

INPUT:
  - πθ (v1) — trained generator
  - πφ (v2) — trained verifier  
  - Problem set 𝒟ₚ

TRAINING SETUP:
  The model is trained on trajectories of the form:
  
  [Draft proof] → [Self-critique] → [Revised proof] → [Self-critique] → ... → [Final proof]
  
  The reward is assigned on the FINAL proof only.
  So the model learns: spending compute on self-critique is worth it
  if the final proof improves.

WHAT THE MODEL LEARNS TO DO (at inference):
  1. Write a draft proof for problem X
  2. Apply verifier logic to its own draft:
       "Step 3 assumes Y is bounded, but I never proved that..."
  3. Revise the proof to fix the identified issue
  4. Check again: "Is there anything else wrong?"
  5. Repeat until it can no longer find any issues
  6. Output the final proof

OUTPUT:
  πθ (v2) — a generator that can self-verify and self-refine
  The model now "knows" its reward function and can deliberately
  optimize toward it, not through random sampling but through
  targeted self-correction

PHASE 4 — Maintaining the Gap: Scaling Verification Compute

As πθ gets stronger, the proofs it generates become harder to evaluate correctly. The verifier's training data (from the weak early generator) no longer covers the kinds of subtle near-correct proofs the better generator now produces.

PROBLEM:
  Strong generator produces hard proofs
  → Verifier hasn't seen this difficulty level
  → Verifier starts making wrong calls
  → Bad reward signal → generator training degrades

SOLUTION — Auto-labeling hard proofs:

INPUT:
  - New hard proofs from the improved πθ
  - πφ (v2) — current verifier

PROCESS:
  For each hard new proof Y on problem X:
    Run πφ many times (scale compute)
    → get many score samples {s₁, s₂, s₃, ...}
    → aggregate into a robust consensus score s*
    
  This gives you (X, Y, s*) without human annotation

OUTPUT:
  New expanded 𝒟ᵥ with hard-to-verify proofs included
  → Retrain πφ → get πφ (v3) that handles harder proofs
  → Use πφ (v3) to train πθ further → πθ (v3)
  → Repeat

The Complete Flywheel, All Together

                    ┌─────────────────────────────┐
                    │   HUMAN EXPERTISE (once)     │
                    │  Score proofs + critiques     │
                    │  𝒟ᵥ and 𝒟ₘᵥ created         │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │   PHASE 1: BUILD VERIFIER    │
                    │  v1 RL → Meta-verifier RL    │
                    │  v2 RL (with R_meta signal)  │
                    │  → πφ: accurate + faithful   │
                    └──────────────┬──────────────┘
                                   │ reward signal
                    ┌──────────────▼──────────────┐
                    │   PHASE 2: TRAIN GENERATOR   │
                    │  πθ RL using πφ as reward    │
                    │  → can write good proofs     │
                    └──────────────┬──────────────┘
                                   │ internalize verifier
                    ┌──────────────▼──────────────┐
                    │   PHASE 3: SELF-VERIFICATION │
                    │  Train on refine trajectories│
                    │  → πθ self-critiques + fixes │
                    └──────────────┬──────────────┘
                                   │ generator now stronger
                    ┌──────────────▼──────────────┐
                    │   PHASE 4: SCALE VERIFY      │
                    │  New hard proofs auto-labeled│
                    │  → πφ retrained on harder data│
                    └──────────────┬──────────────┘
                                   │ verifier now stronger
                                   └──► back to PHASE 2

Each lap around this loop: both the verifier and generator get stronger. Human annotation is only needed once to bootstrap. After that, the system is self-sustaining.