Applying RLVR to EDR Evasion Code Generation

A proof-of-concept exploring graduated reward signals from Elastic Security detection for training code generation models.

Status: Experimental (Cycle 6 peak)
Hardware: AMD Strix Halo (gfx1151)
License: Apache 2.0
Notice: This project is for authorized security research only. Use responsibly and only on systems you own or have explicit permission to test.

Overview

malagent explores whether RLVR (Reinforcement Learning from Verifier Rewards) can be applied to EDR evasion code generation. Unlike binary detected/undetected signals, this work uses graduated rewards derived from Elastic Security alert severity, providing richer feedback for training.

The hypothesis: more granular reward signals (critical vs. low severity detection) might enable more effective learning than simple pass/fail verification.

Verification Pipeline

  1. Compilation — Does the generated C++ code compile with MSVC?
  2. Execution — Does the binary run without crashing on a Windows VM?
  3. Detection — What alerts (if any) does Elastic Security generate?

Architecture

The system uses four components coordinated by the training host:

┌─────────────────────────────────────────────────────────────────────────────┐ │ MALAGENT INFRASTRUCTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ STRIX HALO (Training Host) DEVBOX (Build Server) │ │ ───────────────────────────── ───────────────────── │ │ AMD gfx1151 / 128GB unified Windows 11 / MSVC │ │ ROCm 7 / Qwen2.5-Coder-7B SSH compilation endpoint │ │ │ │ ELASTIC CONTROLLER WINDOWS VM POOL │ │ ───────────────────────────── ───────────────────── │ │ Docker / ES / Kibana / Fleet Proxmox VMs with Elastic Agent │ │ Detection rules & alerts Sample execution environment │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

Training flow: Generate samples → Compile via SSH → Transfer to VMs → Execute with Elastic monitoring → Query alerts → Calculate rewards → Filter and train.

Reward Structure

The graduated reward signal maps detection outcomes to training signals:

OutcomeRewardDescription
Format failure0.0No valid C++ code extracted from completion
Compile failure0.1Code extracted but MSVC compilation failed
Execution failure0.2Binary crashed before detection could occur
Critical detection0.5Highest severity alert triggered
High detection0.6High severity alert
Medium detection0.7Medium severity alert
Low detection0.8Low severity alert
Evaded1.0No detection alerts generated

Additional penalties are applied for multiple rule triggers, behavioral detections (EQL/ML rules), and high risk scores.

Methodology

Phase 1: Supervised Fine-Tuning

Qwen2.5-Coder-7B is fine-tuned on Windows systems programming examples to establish baseline competence with Windows APIs, PE structures, and C++ patterns. This phase ran for 19.5 hours with loss reduction from 1.16 to 0.32.

Phase 2: RAFT Training

Iterative refinement using the RAFT algorithm:

For each cycle: 1. Generate N samples per prompt (batch inference) 2. Compile all samples on DEVBOX via SSH 3. Transfer compiled binaries to VM pool 4. Execute with sample_id correlation markers 5. Poll Elastic API for alerts (60-120s detection latency) 6. Calculate graduated rewards from alert severity 7. Filter samples with reward >= 0.5 8. Fine-tune on filtered samples 9. Repeat with updated model

MVR Mode

For faster iteration, MVR (Minimum Viable Reward) mode skips Elastic verification entirely, using compilation success as the sole reward signal. This enables curriculum learning: establish compilation baseline before introducing detection penalties.

Results

Training progressed through 8 RAFT cycles. Cycle 6 achieved peak performance; subsequent cycles showed degradation.

46.7%
Peak compilation (Cycle 6)
52%
Post-SFT baseline
10/15
Techniques at 100% compile
23%
Cycle 8 (degraded)

Technique Coverage

CategoryTechniquesStatus
PE BasicsDOS/NT headers, export parsing, sections100% compile
SyscallsDirect, indirect, Hell's Gate, Halo's Gate40-57% compile
InjectionHollowing, thread hijack, APCMixed
MemoryVirtualAlloc, RWX avoidance, sections100% compile
Unhookingntdll refresh, syscall restoreImproving

Observations

Limitations

Documentation

Installation

git clone https://github.com/professor-moody/malagent.git
cd malagent/toolbox && ./build.sh

toolbox create malagent --image localhost/malagent:latest
toolbox enter malagent

# Configure build server and Elastic
cp configs/elastic_verifier.yaml.example configs/elastic_verifier.yaml

# Run training (MVR mode for compilation-only)
malagent raft train --config configs/raft_config.yaml --mode mvr

# Run with Elastic detection
malagent raft train --config configs/elastic_verifier.yaml --mode elastic

Related Work