Reward Configuration

Configurable reward signals for training

Reward Modes

malagent supports two reward modes, configurable via --reward-mode:

Mode	Behavior
`binary`	0.0 for compile failure, 0.7 for any success
`graduated`	0.0-1.0 scale based on detection severity and other factors

Mode	Verification	Typical Use
`mvr`	Compilation only	Early training phases
`elastic`	Full EDR detection	Evasion training

When using graduated mode with Elastic verification, reward starts at 1.0 and penalties are applied based on:

Samples with reward >= threshold are kept for training. Default threshold: 0.5.