malagent

Exploring RLVR for Security Research

Status: Research Preview

License: Apache 2.0

Notice: This project is for authorized security research only. Use responsibly and only on systems you own or have explicit permission to test.

What is RLVR?

Reinforcement Learning from Verifier Rewards (RLVR) is an approach to training language models using automated verification as the reward signal, rather than human preferences (RLHF) or AI preferences (RLAIF).

The key insight: when outcomes can be objectively verified (code compiles, tests pass, security constraints satisfied), we can use these verification results directly as training signals.

malagent explores applying RLVR concepts to security research domains where verification is inherently possible through detection systems.

Configurable Rewards

malagent supports flexible reward configuration for different training objectives.

Choose between binary pass/fail signals or graduated rewards based on detection severity, with configurable verification modes from compile-only to full EDR detection.

See Reward Configuration for details.

Research

Access

Full documentation, implementation details, and training infrastructure guides are available for authorized research collaborators.

GitHub Repository

References

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment (Dong et al., 2023)