Endogenous Self-Evaluation for Sample-Efficient Reasoning
Instead of only rewarding correct answers, we reward the model for accurate self-assessment. The model generates self-ratings during reasoning and learns to calibrate its confidence—enabling genuine self-correction.
| Model | Samples | AIME24 | MATH500 | GPQA |
|---|---|---|---|---|
| o1-preview | N.A. | 44.6 | 85.5 | 73.3 |
| r1-distill | 800K | 72.6 | 94.3 | 62.1 |
| CoR-32B (Ours) | 1K | 56.7 | 93.0 | 59.6 |