SFT Phase
Supervised fine-tuning to establish baseline
Purpose
SFT (Supervised Fine-Tuning) establishes baseline capability before RAFT refinement. The model learns Windows API patterns, C++ syntax, and technique-specific code structures.
Goals
- Windows API proficiency — Correct use of handles, types, and calling conventions
- C++ syntax baseline — Valid, compilable code structure
- Technique vocabulary — Familiarity with syscalls, injection, and evasion patterns
Without SFT
Starting RAFT from a raw base model results in:
- Very low compile rates (~5%)
- Poor code structure
- Slow convergence
- Wasted compute on invalid samples
Memory Requirements
| Model Size | Weights (bf16) | Optimizer (AdamW) | Gradients | Total |
|---|---|---|---|---|
| 3B | ~6GB | ~12GB | ~6GB | ~28GB |
| 7B | ~14GB | ~28GB | ~14GB | ~60GB |
| 14B | ~28GB | ~56GB | ~28GB | ~120GB |
Strix Halo (128GB unified): Comfortable for 7B, possible for 14B with offloading.
Running SFT
Basic Usage
python malagent/sft/trainer.py \
--model Qwen/Qwen2.5-Coder-7B \
--data data/sft_examples.jsonl \
--output output/sft \
--epochs 3
With Custom Config
python malagent/sft/trainer.py \
--config configs/sft_config.yaml
QLoRA Configuration
QLoRA (Quantized Low-Rank Adaptation) enables efficient fine-tuning by training small adapter matrices.
Configuration Explained
lora:
r: 64 # Rank: dimensionality of adapter matrices
alpha: 128 # Scaling factor: typically 2×r
dropout: 0.05 # Regularization during training
target_modules: # Which layers to adapt
- q_proj # Query projection (attention)
- k_proj # Key projection (attention)
- v_proj # Value projection (attention)
- o_proj # Output projection (attention)
- gate_proj # MLP gating
- up_proj # MLP up-projection
- down_proj # MLP down-projection
Parameter Choices
| Parameter | Low | Default | High | Notes |
|---|---|---|---|---|
r | 8 | 64 | 256 | Higher = more capacity, more memory |
alpha | 16 | 128 | 512 | Typically 2×r |
dropout | 0.0 | 0.05 | 0.1 | Higher = more regularization |
Why These Defaults?
- r=64: Good balance between capacity and efficiency. Lower ranks (8-16) may underfit for complex code patterns.
- alpha=128: Standard 2× ratio maintains gradient magnitude during training.
- All projection layers: Code generation benefits from full model adaptation, not just attention.
Trainable Parameters
Full 7B model: 7,000,000,000 parameters
With LoRA (r=64): 41,943,040 parameters (0.6%)
Memory savings: ~85%
Full Configuration
# configs/sft_config.yaml
sft:
model:
name: "Qwen/Qwen2.5-Coder-7B"
trust_remote_code: true
torch_dtype: "bfloat16"
attn_implementation: "flash_attention_2"
data:
path: "data/sft_examples.jsonl"
max_length: 2048
train_split: 0.9
shuffle: true
seed: 42
lora:
r: 64
alpha: 128
dropout: 0.05
bias: "none"
task_type: "CAUSAL_LM"
target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
training:
epochs: 3
batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 2e-5
warmup_ratio: 0.03
weight_decay: 0.01
lr_scheduler_type: "cosine"
bf16: true
gradient_checkpointing: true
# Checkpointing
save_steps: 500
save_total_limit: 3
logging_steps: 10
eval_steps: 500
# Early stopping
early_stopping_patience: 3
early_stopping_threshold: 0.01
output:
dir: "output/sft"
push_to_hub: false
Training Output
SUPERVISED FINE-TUNING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Model: Qwen/Qwen2.5-Coder-7B
Parameters: 7.07B total, 41.9M trainable (0.6%)
Memory: ~58GB estimated
Data: data/sft_examples.jsonl
Total: 550 examples
Train: 495 (90%)
Eval: 55 (10%)
Configuration:
Epochs: 3
Batch size: 2 × 16 accumulation = 32 effective
Learning rate: 2e-5 (cosine schedule)
Training...
[Epoch 1/3] ████████████████████ 100%
Train loss: 1.16 → 0.72
Eval loss: 0.78
Time: 6.5 hours
[Epoch 2/3] ████████████████████ 100%
Train loss: 0.72 → 0.45
Eval loss: 0.52
Time: 6.5 hours
[Epoch 3/3] ████████████████████ 100%
Train loss: 0.45 → 0.32
Eval loss: 0.34
Time: 6.5 hours
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Final eval loss: 0.34
Total duration: 19.5 hours
Saved to: output/sft/final
Resuming Training
From Checkpoint
# Resume from specific checkpoint
python malagent/sft/trainer.py \
--resume-from output/sft/checkpoint-1500
Automatic Resume
# Auto-detect latest checkpoint
python malagent/sft/trainer.py --resume
Monitoring
Loss Curves
Good training shows:
- Steady decrease in training loss
- Eval loss following (not diverging)
- No sudden spikes
Loss over time:
1.2 ┤
1.0 ┤╲
0.8 ┤ ╲──╮
0.6 ┤ ╲──╮
0.4 ┤ ╲──────
0.2 ┤
└──────────────────
Epoch 1 2 3
Warning Signs
| Symptom | Cause | Fix |
|---|---|---|
| Eval loss increasing | Overfitting | Reduce epochs, increase dropout |
| Loss not decreasing | LR too low/high | Try 1e-5 or 1e-4 |
| Spiky loss | Batch too small | Increase gradient accumulation |
| Very slow training | Checkpointing overhead | Reduce save_steps |
Evaluation After SFT
Before proceeding to RAFT, evaluate the SFT model:
# Quick compile rate check
python malagent/benchmark/evaluate.py \
--model output/sft/final \
--prompts data/eval_prompts.jsonl \
--samples 4
# Expected output
Compile rate: 18.5% (baseline was ~5%)
Expected Results
| Metric | Before SFT | After SFT | Target |
|---|---|---|---|
| Compile Rate | ~5% | 15-25% | >15% |
| Loss | 1.2 | 0.3-0.4 | <0.5 |
| Duration | - | ~20 hours | - |
Next Steps
After SFT completes successfully:
- Evaluate — Verify compile rate is >15%
- Proceed to RAFT — RAFT Training
- Save baseline — Keep SFT checkpoint for comparison