RAFT Training

Reward-rAnked Fine-Tuning with verification feedback

Overview

RAFT (Reward-rAnked Fine-Tuning) iteratively improves the model using verification feedback.

┌────────────────────────────────────────────────────────────┐
│                      RAFT CYCLE                            │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  1. GENERATE ───► 2. VERIFY ───► 3. FILTER ───► 4. TRAIN  │
│        │              │              │              │      │
│        ▼              ▼              ▼              ▼      │
│   N samples      Compile +       Keep top       Fine-tune  │
│   per prompt     Detect          K% by           on kept   │
│                                  reward          samples   │
│                                                            │
│  ◄────────────────── REPEAT ───────────────────────────►  │
│                                                            │
└────────────────────────────────────────────────────────────┘

Running RAFT

python malagent/training/raft_trainer.py \
    --config configs/raft_config.yaml \
    --mode elastic \
    --cycles 6

Configuration

raft:
  model:
    path: "output/sft/final"
    base: "Qwen/Qwen2.5-Coder-7B"
  
  prompts:
    path: "data/prompts/mvr_prompt_v2.jsonl"
  
  generation:
    samples_per_prompt: 8
    temperature: 0.7
    max_new_tokens: 2048
  
  filtering:
    reward_threshold: 0.5
    keep_top_percent: 0.5
  
  training:
    epochs: 1
    batch_size: 2
    gradient_accumulation: 16
    learning_rate: 5e-5
  
  cycles: 6

Reward Structure

MVR Mode (Binary)

Outcome	Reward
Compiles	1.0
Fails	0.0

Elastic Mode (Graduated)

Outcome	Reward
Format fail	0.0
Compile fail	0.1
Crash	0.2
Critical alert	0.5
High alert	0.6
Medium alert	0.7
Low alert	0.8
No detection	1.0

Training Progress

Training progress varies based on model, prompts, and configuration. Monitor these metrics across cycles:

Compiled: Number of samples that compiled successfully
Rate: Percentage of samples that compiled
Loss: Training loss (should decrease over cycles)

When to stop:

Compile rate decreases for 2+ consecutive cycles
Loss starts increasing
The optimal stopping point varies by experiment