Pipeline

Complete RLVR training pipeline

Overview

malagent uses a two-phase training approach:

┌─────────────────────────────────────────────────────────────────────┐
│                        TRAINING PIPELINE                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   PHASE 1   │    │   PHASE 2   │    │        OUTPUT           │  │
│  │    SFT      │───▶│    RAFT     │───▶│   Fine-tuned Model      │  │
│  │  Baseline   │    │  Refinement │    │   + Checkpoints         │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                                     │
│  Duration: ~20h       Duration: ~4-8h/cycle × 6 cycles              │
│  Compile: ~15%        Compile: ~45% (peak)                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Phase 1: Supervised Fine-Tuning (SFT)

Establishes baseline capability on Windows systems programming.

Purpose

  • Teach model Windows API patterns
  • Establish C++ syntax baseline
  • Prime for technique-specific generation

Expected Results

MetricBefore SFTAfter SFT
Compile Rate~5%~15-25%
Loss1.20.3-0.4
Duration-~20 hours

Phase 2: RAFT Training

Iterative refinement using verification feedback.

The RAFT Loop

For cycle = 1 to N:
    1. samples = generate(model, prompts, k)
    2. rewards = verify(samples)
    3. filtered = rank_and_filter(samples, rewards, threshold)
    4. model = fine_tune(model, filtered)

Expected Results

CycleCompile RateImprovement
SFT Baseline15%-
Cycle 128%+13%
Cycle 340%+12%
Cycle 6 (Peak)47%+7%

Training Modes

MVR Mode (Minimum Viable Reward)

Uses compilation-only verification:

  • Faster iteration (~30 min/cycle)
  • No Elastic infrastructure needed
python malagent/training/raft_trainer.py \
    --config configs/raft_config.yaml \
    --mode mvr

Elastic Mode (Full Pipeline)

Uses Elastic Security for detection:

  • Graduated rewards from alert severity
  • Requires full infrastructure
python malagent/training/raft_trainer.py \
    --config configs/raft_config.yaml \
    --mode elastic

When to Stop

Signs to stop training:

  • Plateau: < 2% improvement per cycle
  • Degradation: Compile rate drops
  • Overfitting: Validation loss increases

Typically optimal at cycle 5-6.