Applying RLVR to EDR Evasion Code Generation
A proof-of-concept exploring graduated reward signals from Elastic Security detection for training code generation models.
Overview
malagent explores whether RLVR (Reinforcement Learning from Verifier Rewards) can be applied to EDR evasion code generation. Unlike binary detected/undetected signals, this work uses graduated rewards derived from Elastic Security alert severity, providing richer feedback for training.
The hypothesis: more granular reward signals (critical vs. low severity detection) might enable more effective learning than simple pass/fail verification.
Verification Pipeline
- Compilation — Does the generated C++ code compile with MSVC?
- Execution — Does the binary run without crashing on a Windows VM?
- Detection — What alerts (if any) does Elastic Security generate?
Architecture
The system uses four components coordinated by the training host:
Training flow: Generate samples → Compile via SSH → Transfer to VMs → Execute with Elastic monitoring → Query alerts → Calculate rewards → Filter and train.
Reward Structure
The graduated reward signal maps detection outcomes to training signals:
| Outcome | Reward | Description |
|---|---|---|
| Format failure | 0.0 | No valid C++ code extracted from completion |
| Compile failure | 0.1 | Code extracted but MSVC compilation failed |
| Execution failure | 0.2 | Binary crashed before detection could occur |
| Critical detection | 0.5 | Highest severity alert triggered |
| High detection | 0.6 | High severity alert |
| Medium detection | 0.7 | Medium severity alert |
| Low detection | 0.8 | Low severity alert |
| Evaded | 1.0 | No detection alerts generated |
Additional penalties are applied for multiple rule triggers, behavioral detections (EQL/ML rules), and high risk scores.
Methodology
Phase 1: Supervised Fine-Tuning
Qwen2.5-Coder-7B is fine-tuned on Windows systems programming examples to establish baseline competence with Windows APIs, PE structures, and C++ patterns. This phase ran for 19.5 hours with loss reduction from 1.16 to 0.32.
Phase 2: RAFT Training
Iterative refinement using the RAFT algorithm:
MVR Mode
For faster iteration, MVR (Minimum Viable Reward) mode skips Elastic verification entirely, using compilation success as the sole reward signal. This enables curriculum learning: establish compilation baseline before introducing detection penalties.
Results
Training progressed through 8 RAFT cycles. Cycle 6 achieved peak performance; subsequent cycles showed degradation.
Technique Coverage
| Category | Techniques | Status |
|---|---|---|
| PE Basics | DOS/NT headers, export parsing, sections | 100% compile |
| Syscalls | Direct, indirect, Hell's Gate, Halo's Gate | 40-57% compile |
| Injection | Hollowing, thread hijack, APC | Mixed |
| Memory | VirtualAlloc, RWX avoidance, sections | 100% compile |
| Unhooking | ntdll refresh, syscall restore | Improving |
Observations
- Emergent curriculum learning: The sorted dataset structure enabled natural progression from simple to complex techniques.
- Capability frontier expansion: Model improved from 97 to 111 successful prompts (+14%) in a single cycle.
- Diminishing returns: Performance degraded after Cycle 6 (29.3% → 23%), suggesting need for prompt diversity or learning rate decay.
- GRPO limitations: Online RL methods proved impractical on this hardware; offline RAFT was ~10x faster.
Limitations
- Elastic detection rules represent a single point-in-time signal; real EDR behavior varies across vendors and configurations.
- Compilation and execution success don't guarantee functional malware — only that the code runs without crashing.
- Detection latency (60-120s per sample) significantly impacts training throughput.
- Dataset is limited to specific technique categories; generalization to novel techniques is unclear.
- Results are from a single hardware configuration and may not transfer to other setups.
- The approach may be learning surface-level patterns that satisfy Elastic's specific rules rather than general evasion principles.
- Post-Cycle 6 degradation suggests the training process needs further tuning (prompt diversity, learning rate schedules, early stopping).
Documentation
Installation
git clone https://github.com/professor-moody/malagent.git cd malagent/toolbox && ./build.sh toolbox create malagent --image localhost/malagent:latest toolbox enter malagent # Configure build server and Elastic cp configs/elastic_verifier.yaml.example configs/elastic_verifier.yaml # Run training (MVR mode for compilation-only) malagent raft train --config configs/raft_config.yaml --mode mvr # Run with Elastic detection malagent raft train --config configs/elastic_verifier.yaml --mode elastic
Related Work
- halo-forge — General-purpose RLVR framework for code generation
- RAFT Paper — Reward-rAnked Fine-Tuning (Dong et al., 2023)