MVR Mode
Minimum Viable Reward for fast iteration
What is MVR Mode?
MVR (Minimum Viable Reward) uses compilation success as the only reward signal, skipping Elastic detection.
Standard Mode: Generate → Compile → Execute → Detect → Reward
MVR Mode: Generate → Compile → Reward
When to Use MVR
Use MVR For
- Initial development — Test prompt quality
- Quick iteration — 30 min cycles vs 4-8 hours
- No Elastic setup — Simpler infrastructure
- Curriculum phase 1 — Establish compilation baseline
Use Elastic For
- Full training — Graduated rewards
- Evasion focus — Detection-aware training
- Final cycles — After compilation baseline established
Running MVR
python malagent/training/raft_trainer.py \
--config configs/raft_config.yaml \
--mode mvr \
--cycles 6
Reward Structure
| Outcome | Reward |
|---|---|
| Compiles successfully | 1.0 |
| Compile error | 0.0 |
Performance Comparison
| Metric | MVR Mode | Elastic Mode |
|---|---|---|
| Cycle time | ~30 min | ~4-8 hours |
| Infrastructure | DEVBOX only | Full stack |
| Reward signal | Binary | Graduated (8 levels) |
| Focus | Compilation | Evasion |
Curriculum Learning
Recommended training sequence:
Phase 1: MVR (Cycles 1-3)
python malagent/training/raft_trainer.py \
--mode mvr \
--cycles 3
Goal: Reach ~40% compile rate
Phase 2: Elastic (Cycles 4-6)
python malagent/training/raft_trainer.py \
--mode elastic \
--cycles 3 \
--resume
Goal: Improve evasion while maintaining compile rate