Research
RLVR research foundations
Overview
This section explores the theoretical foundations of applying Reinforcement Learning from Verifier Rewards (RLVR) to security research domains.
Pages
- Theory — RLVR concepts and the RAFT algorithm
- Reward Configuration — Configurable reward signals for training
Background
RLVR represents an alternative to traditional RLHF (human feedback) and RLAIF (AI feedback) approaches. When outcomes can be objectively verified—code compiles, tests pass, security constraints are satisfied—these verification results can serve directly as training signals.
malagent supports configurable reward modes (binary or graduated) and verification modes (compile-only or full EDR detection) to accommodate different training objectives.