Applying RLVR to EDR Evasion Code Generation

A proof-of-concept exploring graduated reward signals from Elastic Security detection for training code generation models.

Status: Experimental (Cycle 6 peak)

Hardware: AMD Strix Halo (gfx1151)

License: Apache 2.0

Notice: This project is for authorized security research only. Use responsibly and only on systems you own or have explicit permission to test.

Overview

malagent explores whether RLVR (Reinforcement Learning from Verifier Rewards) can be applied to EDR evasion code generation. Unlike binary detected/undetected signals, this work uses graduated rewards derived from Elastic Security alert severity, providing richer feedback for training.

The hypothesis: more granular reward signals (critical vs. low severity detection) might enable more effective learning than simple pass/fail verification.

Verification Pipeline

Compilation — Does the generated C++ code compile with MSVC?
Execution — Does the binary run without crashing on a Windows VM?
Detection — What alerts (if any) does Elastic Security generate?

Architecture

The system uses four components coordinated by the training host:

┌─────────────────────────────────────────────────────────────────────────────┐ │ MALAGENT INFRASTRUCTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ STRIX HALO (Training Host) DEVBOX (Build Server) │ │ ───────────────────────────── ───────────────────── │ │ AMD gfx1151 / 128GB unified Windows 11 / MSVC │ │ ROCm 7 / Qwen2.5-Coder-7B SSH compilation endpoint │ │ │ │ ELASTIC CONTROLLER WINDOWS VM POOL │ │ ───────────────────────────── ───────────────────── │ │ Docker / ES / Kibana / Fleet Proxmox VMs with Elastic Agent │ │ Detection rules & alerts Sample execution environment │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

Training flow: Generate samples → Compile via SSH → Transfer to VMs → Execute with Elastic monitoring → Query alerts → Calculate rewards → Filter and train.

Reward Structure

The graduated reward signal maps detection outcomes to training signals:

Outcome	Reward	Description
Format failure	0.0	No valid C++ code extracted from completion
Compile failure	0.1	Code extracted but MSVC compilation failed
Execution failure	0.2	Binary crashed before detection could occur
Critical detection	0.5	Highest severity alert triggered
High detection	0.6	High severity alert
Medium detection	0.7	Medium severity alert
Low detection	0.8	Low severity alert
Evaded	1.0	No detection alerts generated

Additional penalties are applied for multiple rule triggers, behavioral detections (EQL/ML rules), and high risk scores.

Methodology

Phase 1: Supervised Fine-Tuning

Qwen2.5-Coder-7B is fine-tuned on Windows systems programming examples to establish baseline competence with Windows APIs, PE structures, and C++ patterns. This phase ran for 19.5 hours with loss reduction from 1.16 to 0.32.

Phase 2: RAFT Training

Iterative refinement using the RAFT algorithm:

For each cycle: 1. Generate N samples per prompt (batch inference) 2. Compile all samples on DEVBOX via SSH 3. Transfer compiled binaries to VM pool 4. Execute with sample_id correlation markers 5. Poll Elastic API for alerts (60-120s detection latency) 6. Calculate graduated rewards from alert severity 7. Filter samples with reward >= 0.5 8. Fine-tune on filtered samples 9. Repeat with updated model

MVR Mode

For faster iteration, MVR (Minimum Viable Reward) mode skips Elastic verification entirely, using compilation success as the sole reward signal. This enables curriculum learning: establish compilation baseline before introducing detection penalties.

Results

Training progressed through 8 RAFT cycles. Cycle 6 achieved peak performance; subsequent cycles showed degradation.

46.7%

Peak compilation (Cycle 6)

52%

Post-SFT baseline

10/15

Techniques at 100% compile

23%

Cycle 8 (degraded)

Technique Coverage

Category	Techniques	Status
PE Basics	DOS/NT headers, export parsing, sections	100% compile
Syscalls	Direct, indirect, Hell's Gate, Halo's Gate	40-57% compile
Injection	Hollowing, thread hijack, APC	Mixed
Memory	VirtualAlloc, RWX avoidance, sections	100% compile
Unhooking	ntdll refresh, syscall restore	Improving

Observations

Emergent curriculum learning: The sorted dataset structure enabled natural progression from simple to complex techniques.
Capability frontier expansion: Model improved from 97 to 111 successful prompts (+14%) in a single cycle.
Diminishing returns: Performance degraded after Cycle 6 (29.3% → 23%), suggesting need for prompt diversity or learning rate decay.
GRPO limitations: Online RL methods proved impractical on this hardware; offline RAFT was ~10x faster.

Limitations

Elastic detection rules represent a single point-in-time signal; real EDR behavior varies across vendors and configurations.
Compilation and execution success don't guarantee functional malware — only that the code runs without crashing.
Detection latency (60-120s per sample) significantly impacts training throughput.
Dataset is limited to specific technique categories; generalization to novel techniques is unclear.
Results are from a single hardware configuration and may not transfer to other setups.
The approach may be learning surface-level patterns that satisfy Elastic's specific rules rather than general evasion principles.
Post-Cycle 6 degradation suggests the training process needs further tuning (prompt diversity, learning rate schedules, early stopping).

Documentation

Installation

git clone https://github.com/professor-moody/malagent.git
cd malagent/toolbox && ./build.sh

toolbox create malagent --image localhost/malagent:latest
toolbox enter malagent

# Configure build server and Elastic
cp configs/elastic_verifier.yaml.example configs/elastic_verifier.yaml

# Run training (MVR mode for compilation-only)
malagent raft train --config configs/raft_config.yaml --mode mvr

# Run with Elastic detection
malagent raft train --config configs/elastic_verifier.yaml --mode elastic

Related Work

halo-forge — General-purpose RLVR framework for code generation
RAFT Paper — Reward-rAnked Fine-Tuning (Dong et al., 2023)