RAFT Training

Reward-rAnked Fine-Tuning with verification feedback

Overview

RAFT (Reward-rAnked Fine-Tuning) iteratively improves the model using verification feedback.

┌────────────────────────────────────────────────────────────┐
│                      RAFT CYCLE                            │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  1. GENERATE ───► 2. VERIFY ───► 3. FILTER ───► 4. TRAIN  │
│        │              │              │              │      │
│        ▼              ▼              ▼              ▼      │
│   N samples      Compile +       Keep top       Fine-tune  │
│   per prompt     Detect          K% by           on kept   │
│                                  reward          samples   │
│                                                            │
│  ◄────────────────── REPEAT ───────────────────────────►  │
│                                                            │
└────────────────────────────────────────────────────────────┘

Running RAFT

python malagent/training/raft_trainer.py \
    --config configs/raft_config.yaml \
    --mode elastic \
    --cycles 6

Configuration

raft:
  model:
    path: "output/sft/final"
    base: "Qwen/Qwen2.5-Coder-7B"
  
  prompts:
    path: "data/prompts/mvr_prompt_v2.jsonl"
  
  generation:
    samples_per_prompt: 8
    temperature: 0.7
    max_new_tokens: 2048
  
  filtering:
    reward_threshold: 0.5
    keep_top_percent: 0.5
  
  training:
    epochs: 1
    batch_size: 2
    gradient_accumulation: 16
    learning_rate: 5e-5
  
  cycles: 6

Reward Structure

MVR Mode (Binary)

OutcomeReward
Compiles1.0
Fails0.0

Elastic Mode (Graduated)

OutcomeReward
Format fail0.0
Compile fail0.1
Crash0.2
Critical alert0.5
High alert0.6
Medium alert0.7
Low alert0.8
No detection1.0

Training Progress

Training progress varies based on model, prompts, and configuration. Monitor these metrics across cycles:

  • Compiled: Number of samples that compiled successfully
  • Rate: Percentage of samples that compiled
  • Loss: Training loss (should decrease over cycles)

When to stop:

  • Compile rate decreases for 2+ consecutive cycles
  • Loss starts increasing
  • The optimal stopping point varies by experiment