Pipeline

Complete RLVR training pipeline

Overview

malagent uses a two-phase training approach:

┌─────────────────────────────────────────────────────────────────────┐
│                        TRAINING PIPELINE                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   PHASE 1   │    │   PHASE 2   │    │        OUTPUT           │  │
│  │    SFT      │───▶│    RAFT     │───▶│   Fine-tuned Model      │  │
│  │  Baseline   │    │  Refinement │    │   + Checkpoints         │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                                     │
│  Duration: ~20h       Duration: ~4-8h/cycle × 6 cycles              │
│  Compile: ~15%        Compile: ~45% (peak)                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Phase 1: Supervised Fine-Tuning (SFT)

Establishes baseline capability on Windows systems programming.

Purpose

Teach model Windows API patterns
Establish C++ syntax baseline
Prime for technique-specific generation

Expected Results

Metric	Before SFT	After SFT
Compile Rate	~5%	~15-25%
Loss	1.2	0.3-0.4
Duration	-	~20 hours

Phase 2: RAFT Training

Iterative refinement using verification feedback.

The RAFT Loop

For cycle = 1 to N:
    1. samples = generate(model, prompts, k)
    2. rewards = verify(samples)
    3. filtered = rank_and_filter(samples, rewards, threshold)
    4. model = fine_tune(model, filtered)

Expected Results

Cycle	Compile Rate	Improvement
SFT Baseline	15%	-
Cycle 1	28%	+13%
Cycle 3	40%	+12%
Cycle 6 (Peak)	47%	+7%

Training Modes

MVR Mode (Minimum Viable Reward)

Uses compilation-only verification:

Faster iteration (~30 min/cycle)
No Elastic infrastructure needed

python malagent/training/raft_trainer.py \
    --config configs/raft_config.yaml \
    --mode mvr

Elastic Mode (Full Pipeline)

Uses Elastic Security for detection:

Graduated rewards from alert severity
Requires full infrastructure

python malagent/training/raft_trainer.py \
    --config configs/raft_config.yaml \
    --mode elastic

When to Stop

Signs to stop training:

Plateau: < 2% improvement per cycle
Degradation: Compile rate drops
Overfitting: Validation loss increases

Typically optimal at cycle 5-6.