Research

RLVR research foundations

Overview

This section explores the theoretical foundations of applying Reinforcement Learning from Verifier Rewards (RLVR) to security research domains.

Pages

Background

RLVR represents an alternative to traditional RLHF (human feedback) and RLAIF (AI feedback) approaches. When outcomes can be objectively verified—code compiles, tests pass, security constraints are satisfied—these verification results can serve directly as training signals.

malagent supports configurable reward modes (binary or graduated) and verification modes (compile-only or full EDR detection) to accommodate different training objectives.