🎯 CoR: Chain of Reward

Endogenous Self-Evaluation for Sample-Efficient Reasoning

Key Insight

Instead of only rewarding correct answers, we reward the model for accurate self-assessment. The model generates self-ratings during reasoning and learns to calibrate its confidence—enabling genuine self-correction.

Core Formula

R(c) = Rext + λ·Rint + μ·Rimprove + ν·Rconverge

Results: 800× Sample Efficiency

Model Samples AIME24 MATH500 GPQA
o1-preview N.A. 44.6 85.5 73.3
r1-distill 800K 72.6 94.3 62.1
CoR-32B (Ours) 1K 56.7 93.0 59.6

Paper