Pipeline
Complete RLVR training pipeline
Overview
malagent uses a two-phase training approach:
┌─────────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ PHASE 1 │ │ PHASE 2 │ │ OUTPUT │ │
│ │ SFT │───▶│ RAFT │───▶│ Fine-tuned Model │ │
│ │ Baseline │ │ Refinement │ │ + Checkpoints │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
│ Duration: ~20h Duration: ~4-8h/cycle × 6 cycles │
│ Compile: ~15% Compile: ~45% (peak) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Phase 1: Supervised Fine-Tuning (SFT)
Establishes baseline capability on Windows systems programming.
Purpose
- Teach model Windows API patterns
- Establish C++ syntax baseline
- Prime for technique-specific generation
Expected Results
| Metric | Before SFT | After SFT |
|---|---|---|
| Compile Rate | ~5% | ~15-25% |
| Loss | 1.2 | 0.3-0.4 |
| Duration | - | ~20 hours |
Phase 2: RAFT Training
Iterative refinement using verification feedback.
The RAFT Loop
For cycle = 1 to N:
1. samples = generate(model, prompts, k)
2. rewards = verify(samples)
3. filtered = rank_and_filter(samples, rewards, threshold)
4. model = fine_tune(model, filtered)
Expected Results
| Cycle | Compile Rate | Improvement |
|---|---|---|
| SFT Baseline | 15% | - |
| Cycle 1 | 28% | +13% |
| Cycle 3 | 40% | +12% |
| Cycle 6 (Peak) | 47% | +7% |
Training Modes
MVR Mode (Minimum Viable Reward)
Uses compilation-only verification:
- Faster iteration (~30 min/cycle)
- No Elastic infrastructure needed
python malagent/training/raft_trainer.py \
--config configs/raft_config.yaml \
--mode mvr
Elastic Mode (Full Pipeline)
Uses Elastic Security for detection:
- Graduated rewards from alert severity
- Requires full infrastructure
python malagent/training/raft_trainer.py \
--config configs/raft_config.yaml \
--mode elastic
When to Stop
Signs to stop training:
- Plateau: < 2% improvement per cycle
- Degradation: Compile rate drops
- Overfitting: Validation loss increases
Typically optimal at cycle 5-6.