Troubleshooting
Common issues and solutions
Connection Issues
SSH Connection Refused
Error:
paramiko.ssh_exception.NoValidConnectionsError:
[Errno None] Unable to connect to port 22 on 10.0.0.152
Solutions:
# On Windows DEVBOX - Check SSH service
Get-Service sshd
# If stopped:
Start-Service sshd
Set-Service -Name sshd -StartupType 'Automatic'
# Check firewall
Get-NetFirewallRule -Name *ssh*
# If missing:
New-NetFirewallRule -Name sshd -DisplayName 'OpenSSH Server (sshd)' `
-Enabled True -Direction Inbound -Protocol TCP -Action Allow -LocalPort 22
SSH Authentication Failed
Error:
paramiko.ssh_exception.AuthenticationException: Authentication failed.
Solutions:
# Test connection manually
ssh -v -i ~/.ssh/win user@10.0.0.152
# Check key permissions
chmod 600 ~/.ssh/win
chmod 644 ~/.ssh/win.pub
# On Windows, verify authorized_keys
# Must be in: C:\Users\<user>\.ssh\authorized_keys
# For admin users: C:\ProgramData\ssh\administrators_authorized_keys
SSH Timeout During Compilation
Error:
socket.timeout: timed out
Solutions:
# Increase timeout in config
windows:
timeout:
connection: 30 # Was 10
compilation: 60 # Was 30
# Check network latency
ping -c 10 10.0.0.152
# Check for slow compilations
ssh user@devbox "time cl /c test.cpp"
Elastic Connection Failed
Error:
elasticsearch.exceptions.ConnectionError:
ConnectionError(<urllib3.connection.HTTPSConnection>:
Failed to establish a new connection)
Solutions:
# Check Elasticsearch is running
curl -sk -u elastic:password https://localhost:9200/_cluster/health | jq
# Check from training host
curl -sk -u elastic:password https://10.0.20.145:9200/_cluster/health
# Check Docker containers
docker ps | grep -E "(elasticsearch|kibana|fleet)"
# Restart if needed
cd elastic-container && docker compose restart
Elastic Authentication Failed
Error:
elasticsearch.exceptions.AuthenticationException:
AuthenticationException(401, 'security_exception')
Solutions:
# Reset elastic password
docker exec -it elasticsearch bin/elasticsearch-reset-password -u elastic
# Update config
vim configs/elastic_verifier.yaml
# Set new password
# Test authentication
curl -sk -u elastic:NEW_PASSWORD https://localhost:9200/
Training Issues
Out of Memory (OOM)
Error:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.00 GiB (GPU 0; 127.98 GiB total capacity;
118.45 GiB already allocated; 1.23 GiB free)
Solutions:
# Reduce batch size
training:
batch_size: 1 # Minimum
gradient_accumulation_steps: 32 # Compensate
gradient_checkpointing: true # Trade compute for memory
generation:
batch_size: 4 # Reduce from 8
# Add explicit cleanup between phases
import gc
import torch
gc.collect()
torch.cuda.empty_cache()
# Close memory-heavy applications before training
# Cursor IDE uses significant memory
OOM Kills Entire System
Error:
Killed (exit code 137)
# Or system becomes unresponsive
Solutions:
# Monitor memory during training
watch -n 1 "free -h"
# Set OOM score adjustment (protects process)
echo -1000 > /proc/$(pgrep -f raft_trainer)/oom_score_adj
# Use swap as backup
sudo fallocate -l 32G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Loss Not Decreasing
Symptoms:
- Training loss stays flat
- Eval loss increases
- Very slow progress
Solutions:
# Try different learning rates
training:
learning_rate: 1e-5 # Lower (try first)
# or
learning_rate: 1e-4 # Higher
# Check warmup
training:
warmup_ratio: 0.1 # More warmup
# Verify data is loaded correctly
python -c "
from malagent.sft.trainer import SFTTrainer
t = SFTTrainer()
ds = t.load_dataset('data/sft_examples.jsonl')
print(f'Loaded {len(ds[0])} examples')
print(ds[0][0]) # Print first example
"
Very Low Compile Rate (<5%)
Symptoms:
- Almost no samples pass compilation
- Training makes no progress
Solutions:
# Check compiler errors for patterns
grep -h "error C" output/raft/cycle_1/samples.jsonl | \
sed 's/.*error C[0-9]*://' | sort | uniq -c | sort -rn | head -20
# Common patterns:
# - "undeclared identifier" → Missing #include
# - "unresolved external" → Missing library linkage
# - "syntax error" → Bad code structure
# Verify MSVC configuration
compiler:
vcvars_path: "C:\\Program Files\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Auxiliary\\Build\\vcvars64.bat"
Model Loading Failure
Error:
OSError: Can't load tokenizer for 'output/sft/final'.
If you were trying to load it from 'https://huggingface.co/models'...
Solutions:
# Check checkpoint contents
ls -la output/sft/final/
# Should contain: tokenizer.json, tokenizer_config.json, adapter_model.safetensors
# If tokenizer missing, copy from base model
cp ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-7B/snapshots/*/tokenizer* output/sft/final/
Checkpoint Loading Mismatch
Error:
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.layers.0.self_attn.q_proj.lora_A.default.weight
Solutions:
# Verify LoRA config matches
cat output/sft/final/adapter_config.json | jq '.r, .lora_alpha'
# Must match current config
# If mismatch, start fresh or update config
Verification Issues
No Alerts Appearing
Symptoms:
- All samples get reward=1.0 (evaded)
- But samples should be detected
Solutions:
# Check Elastic Agent status in Fleet
# Kibana → Fleet → Agents → Check status
# Verify detection rules are enabled
# Kibana → Security → Rules → Check enabled count
# Test alert generation manually
curl -sk -u elastic:password \
"https://localhost:9200/.alerts-security.alerts-default/_count" | jq
# Increase detection timeout
detection:
timeout: 180 # Increase from 120
poll_interval: 10 # More frequent polling
Samples Not Executing on VMs
Error:
Execution failed: Binary not found on VM
Solutions:
# Test SSH connection to VM
ssh -i ~/.ssh/win labuser@10.0.20.101 "whoami"
# Check staging directory exists
ssh labuser@vm-ip "if exist C:\MalwareStaging\samples echo OK"
# Check transfer succeeded
ssh labuser@vm-ip "dir C:\MalwareStaging\samples"
All Samples Getting Same Reward
Symptoms:
- Every sample gets exactly 0.5 or 0.6
- No variation in rewards
Solutions:
# Check correlation is working
# Add debug logging
import logging
logging.getLogger('malagent.verifiers').setLevel(logging.DEBUG)
# Verify sample IDs are unique
# Check alerts include sample correlation markers
Detection Timeout Exceeded
Error:
TimeoutError: Detection polling exceeded 120s
Solutions:
# Increase timeout
detection:
timeout: 180
rule_delay: 45 # More initial delay
# Check Elasticsearch performance
curl -sk -u elastic:password \
"https://localhost:9200/_cluster/stats?human" | jq '.indices'
Environment Issues
ROCm Not Detected
Error:
RuntimeError: No HIP GPUs are available
Solutions:
# Verify ROCm installation
rocm-smi
# Check permissions
groups # Should include 'video' and 'render'
sudo usermod -aG video,render $USER
# Logout and login again
# Inside toolbox, check device access
ls -la /dev/kfd /dev/dri
Wrong PyTorch Version
Error:
ImportError: cannot import name 'flash_attn_func' from 'flash_attn'
Solutions:
# Check PyTorch ROCm build
python -c "import torch; print(torch.version.hip)"
# Reinstall if wrong
pip install torch --index-url https://download.pytorch.org/whl/rocm6.0
Module Not Found
Error:
ModuleNotFoundError: No module named 'malagent'
Solutions:
# Install in development mode
cd /path/to/malagent
pip install -e .
# Verify installation
python -c "from malagent.cli import main; print('OK')"
# Check PYTHONPATH if needed
export PYTHONPATH=/path/to/malagent:$PYTHONPATH
Debug Mode
Enable debug logging for detailed troubleshooting:
# Set environment variable
export MALAGENT_LOG_LEVEL=DEBUG
# Or in code
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger('malagent').setLevel(logging.DEBUG)
Getting Help
If issues persist:
Collect logs:
MALAGENT_LOG_LEVEL=DEBUG python malagent/training/raft_trainer.py 2>&1 | tee debug.logCheck component status:
./toolbox/verify.shOpen issue on GitHub with:
- Error message and full traceback
- Configuration (redact passwords)
- Hardware specifications
- Steps to reproduce