RAFT Training
Reward-rAnked Fine-Tuning with verification feedback
Overview
RAFT (Reward-rAnked Fine-Tuning) iteratively improves the model using verification feedback.
┌────────────────────────────────────────────────────────────┐
│ RAFT CYCLE │
├────────────────────────────────────────────────────────────┤
│ │
│ 1. GENERATE ───► 2. VERIFY ───► 3. FILTER ───► 4. TRAIN │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ N samples Compile + Keep top Fine-tune │
│ per prompt Detect K% by on kept │
│ reward samples │
│ │
│ ◄────────────────── REPEAT ───────────────────────────► │
│ │
└────────────────────────────────────────────────────────────┘
Running RAFT
python malagent/training/raft_trainer.py \
--config configs/raft_config.yaml \
--mode elastic \
--cycles 6
Configuration
raft:
model:
path: "output/sft/final"
base: "Qwen/Qwen2.5-Coder-7B"
prompts:
path: "data/prompts/mvr_prompt_v2.jsonl"
generation:
samples_per_prompt: 8
temperature: 0.7
max_new_tokens: 2048
filtering:
reward_threshold: 0.5
keep_top_percent: 0.5
training:
epochs: 1
batch_size: 2
gradient_accumulation: 16
learning_rate: 5e-5
cycles: 6
Reward Structure
MVR Mode (Binary)
| Outcome | Reward |
|---|---|
| Compiles | 1.0 |
| Fails | 0.0 |
Elastic Mode (Graduated)
| Outcome | Reward |
|---|---|
| Format fail | 0.0 |
| Compile fail | 0.1 |
| Crash | 0.2 |
| Critical alert | 0.5 |
| High alert | 0.6 |
| Medium alert | 0.7 |
| Low alert | 0.8 |
| No detection | 1.0 |
Training Progress
Training progress varies based on model, prompts, and configuration. Monitor these metrics across cycles:
- Compiled: Number of samples that compiled successfully
- Rate: Percentage of samples that compiled
- Loss: Training loss (should decrease over cycles)
When to stop:
- Compile rate decreases for 2+ consecutive cycles
- Loss starts increasing
- The optimal stopping point varies by experiment