Graduated Rewards

Detection severity as training signal

The Problem with Binary Rewards

Traditional approaches use binary signals:

OutcomeReward
Evaded1.0
Detected0.0

This loses information:

  • All detections treated equally
  • No gradient for improvement
  • Critical vs. low severity ignored

Graduated Reward Design

malagent uses alert severity as a gradient:

Complete Mapping

OutcomeRewardTraining Signal
Format failure0.0No valid code
Compile error0.1Invalid C++
Execution crash0.2Runs but crashes
Critical detection0.5Worst evasion
High detection0.6Poor evasion
Medium detection0.7Moderate evasion
Low detection0.8Near success
Evaded1.0Success

Why This Design?

Threshold at 0.5

RAFT Filtering:
  reward >= 0.5  →  Keep for training
  reward < 0.5   →  Discard

This means:

  • All executing code trains the model
  • Even detected code provides signal
  • Only compile failures discarded

Compile Failures Get 0.1

Small positive reward for valid C++ syntax encourages the model to at least generate compilable code.

Impact on Training

With Binary Rewards

Cycle 1:
  Evaded: 50 samples (reward=1.0)
  Detected: 700 samples (reward=0.0)
  Training data: 50 samples

With Graduated Rewards

Cycle 1:
  Evaded: 50 samples (reward=1.0)
  Low: 100 samples (reward=0.8)
  Medium: 200 samples (reward=0.7)
  High: 250 samples (reward=0.6)
  Critical: 150 samples (reward=0.5)
  Training data: 750 samples

15× more training data!

Recommendations

  1. Start with defaults — They’re balanced for general use
  2. Monitor distribution — Check reward histogram each cycle
  3. Adjust if needed — If most samples at one level, rebalance