MVR Mode

Minimum Viable Reward for fast iteration

What is MVR Mode?

MVR (Minimum Viable Reward) uses compilation success as the only reward signal, skipping Elastic detection.

Standard Mode:  Generate → Compile → Execute → Detect → Reward
MVR Mode:       Generate → Compile → Reward

When to Use MVR

Use MVR For

  • Initial development — Test prompt quality
  • Quick iteration — 30 min cycles vs 4-8 hours
  • No Elastic setup — Simpler infrastructure
  • Curriculum phase 1 — Establish compilation baseline

Use Elastic For

  • Full training — Graduated rewards
  • Evasion focus — Detection-aware training
  • Final cycles — After compilation baseline established

Running MVR

python malagent/training/raft_trainer.py \
    --config configs/raft_config.yaml \
    --mode mvr \
    --cycles 6

Reward Structure

OutcomeReward
Compiles successfully1.0
Compile error0.0

Performance Comparison

MetricMVR ModeElastic Mode
Cycle time~30 min~4-8 hours
InfrastructureDEVBOX onlyFull stack
Reward signalBinaryGraduated (8 levels)
FocusCompilationEvasion

Curriculum Learning

Recommended training sequence:

Phase 1: MVR (Cycles 1-3)

python malagent/training/raft_trainer.py \
    --mode mvr \
    --cycles 3

Goal: Reach ~40% compile rate

Phase 2: Elastic (Cycles 4-6)

python malagent/training/raft_trainer.py \
    --mode elastic \
    --cycles 3 \
    --resume

Goal: Improve evasion while maintaining compile rate