Theory
RLVR research foundations
RLVR: Reinforcement Learning from Verifier Rewards
Traditional LLM training approaches:
| Approach | Feedback Source |
|---|---|
| RLHF | Human preferences |
| RLAIF | AI preferences |
| RLVR | Automated verification |
RLVR uses verifiable outcomes as reward signals:
- Does the code compile?
- Does it execute correctly?
- Does it evade detection?
Advantages
- Objective — Binary/graduated outcomes, no subjective judgment
- Scalable — No human annotation required
- Domain-specific — Captures exact constraints
- Deterministic — Same input → same reward
RAFT Algorithm
RAFT (Reward-rAnked Fine-Tuning) is our implementation:
For cycle = 1 to N:
1. samples = generate(model, prompts, k)
2. rewards = verify(samples)
3. filtered = rank_and_filter(samples, rewards, threshold)
4. model = fine_tune(model, filtered)
Key Insight
RAFT is essentially iterated rejection sampling:
- Generate many candidates
- Keep only the good ones
- Learn from successes
Comparison
| Method | Models | Stability | Speed |
|---|---|---|---|
| PPO | 2-4× | Medium | Slow |
| GRPO | 2× | High | Medium |
| RAFT | 1× | High | Fast |
References
- RAFT: Dong et al., “RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment” (2023)
- RLVR: DeepMind, “Reinforcement Learning from Verifiable Rewards” concept