Graduated Rewards
Detection severity as training signal
The Problem with Binary Rewards
Traditional approaches use binary signals:
| Outcome | Reward |
|---|---|
| Evaded | 1.0 |
| Detected | 0.0 |
This loses information:
- All detections treated equally
- No gradient for improvement
- Critical vs. low severity ignored
Graduated Reward Design
malagent uses alert severity as a gradient:
Complete Mapping
| Outcome | Reward | Training Signal |
|---|---|---|
| Format failure | 0.0 | No valid code |
| Compile error | 0.1 | Invalid C++ |
| Execution crash | 0.2 | Runs but crashes |
| Critical detection | 0.5 | Worst evasion |
| High detection | 0.6 | Poor evasion |
| Medium detection | 0.7 | Moderate evasion |
| Low detection | 0.8 | Near success |
| Evaded | 1.0 | Success |
Why This Design?
Threshold at 0.5
RAFT Filtering:
reward >= 0.5 → Keep for training
reward < 0.5 → Discard
This means:
- All executing code trains the model
- Even detected code provides signal
- Only compile failures discarded
Compile Failures Get 0.1
Small positive reward for valid C++ syntax encourages the model to at least generate compilable code.
Impact on Training
With Binary Rewards
Cycle 1:
Evaded: 50 samples (reward=1.0)
Detected: 700 samples (reward=0.0)
Training data: 50 samples
With Graduated Rewards
Cycle 1:
Evaded: 50 samples (reward=1.0)
Low: 100 samples (reward=0.8)
Medium: 200 samples (reward=0.7)
High: 250 samples (reward=0.6)
Critical: 150 samples (reward=0.5)
Training data: 750 samples
15× more training data!
Recommendations
- Start with defaults — They’re balanced for general use
- Monitor distribution — Check reward histogram each cycle
- Adjust if needed — If most samples at one level, rebalance