SFT Phase

Supervised fine-tuning to establish baseline

Purpose

SFT (Supervised Fine-Tuning) establishes baseline capability before RAFT refinement. The model learns Windows API patterns, C++ syntax, and technique-specific code structures.

Goals

  1. Windows API proficiency — Correct use of handles, types, and calling conventions
  2. C++ syntax baseline — Valid, compilable code structure
  3. Technique vocabulary — Familiarity with syscalls, injection, and evasion patterns

Without SFT

Starting RAFT from a raw base model results in:

  • Very low compile rates (~5%)
  • Poor code structure
  • Slow convergence
  • Wasted compute on invalid samples

Memory Requirements

Model SizeWeights (bf16)Optimizer (AdamW)GradientsTotal
3B~6GB~12GB~6GB~28GB
7B~14GB~28GB~14GB~60GB
14B~28GB~56GB~28GB~120GB

Strix Halo (128GB unified): Comfortable for 7B, possible for 14B with offloading.

Running SFT

Basic Usage

python malagent/sft/trainer.py \
    --model Qwen/Qwen2.5-Coder-7B \
    --data data/sft_examples.jsonl \
    --output output/sft \
    --epochs 3

With Custom Config

python malagent/sft/trainer.py \
    --config configs/sft_config.yaml

QLoRA Configuration

QLoRA (Quantized Low-Rank Adaptation) enables efficient fine-tuning by training small adapter matrices.

Configuration Explained

lora:
  r: 64              # Rank: dimensionality of adapter matrices
  alpha: 128         # Scaling factor: typically 2×r
  dropout: 0.05      # Regularization during training
  target_modules:    # Which layers to adapt
    - q_proj         # Query projection (attention)
    - k_proj         # Key projection (attention)
    - v_proj         # Value projection (attention)
    - o_proj         # Output projection (attention)
    - gate_proj      # MLP gating
    - up_proj        # MLP up-projection
    - down_proj      # MLP down-projection

Parameter Choices

ParameterLowDefaultHighNotes
r864256Higher = more capacity, more memory
alpha16128512Typically 2×r
dropout0.00.050.1Higher = more regularization

Why These Defaults?

  • r=64: Good balance between capacity and efficiency. Lower ranks (8-16) may underfit for complex code patterns.
  • alpha=128: Standard 2× ratio maintains gradient magnitude during training.
  • All projection layers: Code generation benefits from full model adaptation, not just attention.

Trainable Parameters

Full 7B model:      7,000,000,000 parameters
With LoRA (r=64):      41,943,040 parameters (0.6%)
Memory savings:              ~85%

Full Configuration

# configs/sft_config.yaml
sft:
  model:
    name: "Qwen/Qwen2.5-Coder-7B"
    trust_remote_code: true
    torch_dtype: "bfloat16"
    attn_implementation: "flash_attention_2"
  
  data:
    path: "data/sft_examples.jsonl"
    max_length: 2048
    train_split: 0.9
    shuffle: true
    seed: 42
  
  lora:
    r: 64
    alpha: 128
    dropout: 0.05
    bias: "none"
    task_type: "CAUSAL_LM"
    target_modules:
      - q_proj
      - k_proj
      - v_proj
      - o_proj
      - gate_proj
      - up_proj
      - down_proj
  
  training:
    epochs: 3
    batch_size: 2
    gradient_accumulation_steps: 16
    learning_rate: 2e-5
    warmup_ratio: 0.03
    weight_decay: 0.01
    lr_scheduler_type: "cosine"
    bf16: true
    gradient_checkpointing: true
    
    # Checkpointing
    save_steps: 500
    save_total_limit: 3
    logging_steps: 10
    eval_steps: 500
    
    # Early stopping
    early_stopping_patience: 3
    early_stopping_threshold: 0.01
  
  output:
    dir: "output/sft"
    push_to_hub: false

Training Output

SUPERVISED FINE-TUNING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Model: Qwen/Qwen2.5-Coder-7B
  Parameters: 7.07B total, 41.9M trainable (0.6%)
  Memory: ~58GB estimated

Data: data/sft_examples.jsonl
  Total: 550 examples
  Train: 495 (90%)
  Eval: 55 (10%)

Configuration:
  Epochs: 3
  Batch size: 2 × 16 accumulation = 32 effective
  Learning rate: 2e-5 (cosine schedule)

Training...
  [Epoch 1/3] ████████████████████ 100%
    Train loss: 1.16 → 0.72
    Eval loss: 0.78
    Time: 6.5 hours

  [Epoch 2/3] ████████████████████ 100%
    Train loss: 0.72 → 0.45
    Eval loss: 0.52
    Time: 6.5 hours

  [Epoch 3/3] ████████████████████ 100%
    Train loss: 0.45 → 0.32
    Eval loss: 0.34
    Time: 6.5 hours

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Final eval loss: 0.34
Total duration: 19.5 hours
Saved to: output/sft/final

Resuming Training

From Checkpoint

# Resume from specific checkpoint
python malagent/sft/trainer.py \
    --resume-from output/sft/checkpoint-1500

Automatic Resume

# Auto-detect latest checkpoint
python malagent/sft/trainer.py --resume

Monitoring

Loss Curves

Good training shows:

  • Steady decrease in training loss
  • Eval loss following (not diverging)
  • No sudden spikes
Loss over time:
1.2 ┤
1.0 ┤╲
0.8 ┤ ╲──╮
0.6 ┤    ╲──╮
0.4 ┤       ╲──────
0.2 ┤
    └──────────────────
       Epoch 1  2  3

Warning Signs

SymptomCauseFix
Eval loss increasingOverfittingReduce epochs, increase dropout
Loss not decreasingLR too low/highTry 1e-5 or 1e-4
Spiky lossBatch too smallIncrease gradient accumulation
Very slow trainingCheckpointing overheadReduce save_steps

Evaluation After SFT

Before proceeding to RAFT, evaluate the SFT model:

# Quick compile rate check
python malagent/benchmark/evaluate.py \
    --model output/sft/final \
    --prompts data/eval_prompts.jsonl \
    --samples 4

# Expected output
Compile rate: 18.5% (baseline was ~5%)

Expected Results

MetricBefore SFTAfter SFTTarget
Compile Rate~5%15-25%>15%
Loss1.20.3-0.4<0.5
Duration-~20 hours-

Next Steps

After SFT completes successfully:

  1. Evaluate — Verify compile rate is >15%
  2. Proceed to RAFTRAFT Training
  3. Save baseline — Keep SFT checkpoint for comparison