SFT Phase

Supervised fine-tuning to establish baseline

Purpose

SFT (Supervised Fine-Tuning) establishes baseline capability before RAFT refinement. The model learns Windows API patterns, C++ syntax, and technique-specific code structures.

Goals

Windows API proficiency — Correct use of handles, types, and calling conventions
C++ syntax baseline — Valid, compilable code structure
Technique vocabulary — Familiarity with syscalls, injection, and evasion patterns

Without SFT

Starting RAFT from a raw base model results in:

Very low compile rates (~5%)
Poor code structure
Slow convergence
Wasted compute on invalid samples

Memory Requirements

Model Size	Weights (bf16)	Optimizer (AdamW)	Gradients	Total
3B	~6GB	~12GB	~6GB	~28GB
7B	~14GB	~28GB	~14GB	~60GB
14B	~28GB	~56GB	~28GB	~120GB

Strix Halo (128GB unified): Comfortable for 7B, possible for 14B with offloading.

Running SFT

Basic Usage

python malagent/sft/trainer.py \
    --model Qwen/Qwen2.5-Coder-7B \
    --data data/sft_examples.jsonl \
    --output output/sft \
    --epochs 3

With Custom Config

python malagent/sft/trainer.py \
    --config configs/sft_config.yaml

QLoRA Configuration

QLoRA (Quantized Low-Rank Adaptation) enables efficient fine-tuning by training small adapter matrices.

Configuration Explained

lora:
  r: 64              # Rank: dimensionality of adapter matrices
  alpha: 128         # Scaling factor: typically 2×r
  dropout: 0.05      # Regularization during training
  target_modules:    # Which layers to adapt
    - q_proj         # Query projection (attention)
    - k_proj         # Key projection (attention)
    - v_proj         # Value projection (attention)
    - o_proj         # Output projection (attention)
    - gate_proj      # MLP gating
    - up_proj        # MLP up-projection
    - down_proj      # MLP down-projection

Parameter Choices

Parameter	Low	Default	High	Notes
`r`	8	64	256	Higher = more capacity, more memory
`alpha`	16	128	512	Typically 2×r
`dropout`	0.0	0.05	0.1	Higher = more regularization

Why These Defaults?

r=64: Good balance between capacity and efficiency. Lower ranks (8-16) may underfit for complex code patterns.
alpha=128: Standard 2× ratio maintains gradient magnitude during training.
All projection layers: Code generation benefits from full model adaptation, not just attention.

Trainable Parameters

Full 7B model:      7,000,000,000 parameters
With LoRA (r=64):      41,943,040 parameters (0.6%)
Memory savings:              ~85%

Full Configuration

# configs/sft_config.yaml
sft:
  model:
    name: "Qwen/Qwen2.5-Coder-7B"
    trust_remote_code: true
    torch_dtype: "bfloat16"
    attn_implementation: "flash_attention_2"
  
  data:
    path: "data/sft_examples.jsonl"
    max_length: 2048
    train_split: 0.9
    shuffle: true
    seed: 42
  
  lora:
    r: 64
    alpha: 128
    dropout: 0.05
    bias: "none"
    task_type: "CAUSAL_LM"
    target_modules:
      - q_proj
      - k_proj
      - v_proj
      - o_proj
      - gate_proj
      - up_proj
      - down_proj
  
  training:
    epochs: 3
    batch_size: 2
    gradient_accumulation_steps: 16
    learning_rate: 2e-5
    warmup_ratio: 0.03
    weight_decay: 0.01
    lr_scheduler_type: "cosine"
    bf16: true
    gradient_checkpointing: true
    
    # Checkpointing
    save_steps: 500
    save_total_limit: 3
    logging_steps: 10
    eval_steps: 500
    
    # Early stopping
    early_stopping_patience: 3
    early_stopping_threshold: 0.01
  
  output:
    dir: "output/sft"
    push_to_hub: false

Training Output

SUPERVISED FINE-TUNING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Model: Qwen/Qwen2.5-Coder-7B
  Parameters: 7.07B total, 41.9M trainable (0.6%)
  Memory: ~58GB estimated

Data: data/sft_examples.jsonl
  Total: 550 examples
  Train: 495 (90%)
  Eval: 55 (10%)

Configuration:
  Epochs: 3
  Batch size: 2 × 16 accumulation = 32 effective
  Learning rate: 2e-5 (cosine schedule)

Training...
  [Epoch 1/3] ████████████████████ 100%
    Train loss: 1.16 → 0.72
    Eval loss: 0.78
    Time: 6.5 hours

  [Epoch 2/3] ████████████████████ 100%
    Train loss: 0.72 → 0.45
    Eval loss: 0.52
    Time: 6.5 hours

  [Epoch 3/3] ████████████████████ 100%
    Train loss: 0.45 → 0.32
    Eval loss: 0.34
    Time: 6.5 hours

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Final eval loss: 0.34
Total duration: 19.5 hours
Saved to: output/sft/final

Resuming Training

From Checkpoint

# Resume from specific checkpoint
python malagent/sft/trainer.py \
    --resume-from output/sft/checkpoint-1500

Automatic Resume

# Auto-detect latest checkpoint
python malagent/sft/trainer.py --resume

Monitoring

Loss Curves

Good training shows:

Steady decrease in training loss
Eval loss following (not diverging)
No sudden spikes

Loss over time:
1.2 ┤
1.0 ┤╲
0.8 ┤ ╲──╮
0.6 ┤    ╲──╮
0.4 ┤       ╲──────
0.2 ┤
    └──────────────────
       Epoch 1  2  3

Warning Signs

Symptom	Cause	Fix
Eval loss increasing	Overfitting	Reduce epochs, increase dropout
Loss not decreasing	LR too low/high	Try 1e-5 or 1e-4
Spiky loss	Batch too small	Increase gradient accumulation
Very slow training	Checkpointing overhead	Reduce save_steps

Evaluation After SFT

Before proceeding to RAFT, evaluate the SFT model:

# Quick compile rate check
python malagent/benchmark/evaluate.py \
    --model output/sft/final \
    --prompts data/eval_prompts.jsonl \
    --samples 4

# Expected output
Compile rate: 18.5% (baseline was ~5%)

Expected Results

Metric	Before SFT	After SFT	Target
Compile Rate	~5%	15-25%	>15%
Loss	1.2	0.3-0.4	<0.5
Duration	-	~20 hours	-

Next Steps

After SFT completes successfully:

Evaluate — Verify compile rate is >15%
Proceed to RAFT — RAFT Training
Save baseline — Keep SFT checkpoint for comparison