Theory

RLVR research foundations

RLVR: Reinforcement Learning from Verifier Rewards

Traditional LLM training approaches:

Approach	Feedback Source
RLHF	Human preferences
RLAIF	AI preferences
RLVR	Automated verification

RLVR uses verifiable outcomes as reward signals:

Does the code compile?
Does it execute correctly?
Does it evade detection?

Advantages

Objective — Binary/graduated outcomes, no subjective judgment
Scalable — No human annotation required
Domain-specific — Captures exact constraints
Deterministic — Same input → same reward

RAFT Algorithm

RAFT (Reward-rAnked Fine-Tuning) is our implementation:

For cycle = 1 to N:
    1. samples = generate(model, prompts, k)
    2. rewards = verify(samples)
    3. filtered = rank_and_filter(samples, rewards, threshold)
    4. model = fine_tune(model, filtered)

Key Insight

RAFT is essentially iterated rejection sampling:

Generate many candidates
Keep only the good ones
Learn from successes

Comparison

Method	Models	Stability	Speed
PPO	2-4×	Medium	Slow
GRPO	2×	High	Medium
RAFT	1×	High	Fast

References

RAFT: Dong et al., “RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment” (2023)
RLVR: DeepMind, “Reinforcement Learning from Verifiable Rewards” concept