Theory

RLVR research foundations

RLVR: Reinforcement Learning from Verifier Rewards

Traditional LLM training approaches:

ApproachFeedback Source
RLHFHuman preferences
RLAIFAI preferences
RLVRAutomated verification

RLVR uses verifiable outcomes as reward signals:

  • Does the code compile?
  • Does it execute correctly?
  • Does it evade detection?

Advantages

  1. Objective — Binary/graduated outcomes, no subjective judgment
  2. Scalable — No human annotation required
  3. Domain-specific — Captures exact constraints
  4. Deterministic — Same input → same reward

RAFT Algorithm

RAFT (Reward-rAnked Fine-Tuning) is our implementation:

For cycle = 1 to N:
    1. samples = generate(model, prompts, k)
    2. rewards = verify(samples)
    3. filtered = rank_and_filter(samples, rewards, threshold)
    4. model = fine_tune(model, filtered)

Key Insight

RAFT is essentially iterated rejection sampling:

  • Generate many candidates
  • Keep only the good ones
  • Learn from successes

Comparison

MethodModelsStabilitySpeed
PPO2-4×MediumSlow
GRPOHighMedium
RAFTHighFast

References

  1. RAFT: Dong et al., “RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment” (2023)
  2. RLVR: DeepMind, “Reinforcement Learning from Verifiable Rewards” concept