Troubleshooting

Common issues and solutions

Connection Issues

SSH Connection Refused

Error:

paramiko.ssh_exception.NoValidConnectionsError: 
  [Errno None] Unable to connect to port 22 on 10.0.0.152

Solutions:

# On Windows DEVBOX - Check SSH service
Get-Service sshd
# If stopped:
Start-Service sshd
Set-Service -Name sshd -StartupType 'Automatic'

# Check firewall
Get-NetFirewallRule -Name *ssh*
# If missing:
New-NetFirewallRule -Name sshd -DisplayName 'OpenSSH Server (sshd)' `
    -Enabled True -Direction Inbound -Protocol TCP -Action Allow -LocalPort 22

SSH Authentication Failed

Error:

paramiko.ssh_exception.AuthenticationException: Authentication failed.

Solutions:

# Test connection manually
ssh -v -i ~/.ssh/win user@10.0.0.152

# Check key permissions
chmod 600 ~/.ssh/win
chmod 644 ~/.ssh/win.pub

# On Windows, verify authorized_keys
# Must be in: C:\Users\<user>\.ssh\authorized_keys
# For admin users: C:\ProgramData\ssh\administrators_authorized_keys

SSH Timeout During Compilation

Error:

socket.timeout: timed out

Solutions:

# Increase timeout in config
windows:
  timeout:
    connection: 30    # Was 10
    compilation: 60   # Was 30
# Check network latency
ping -c 10 10.0.0.152

# Check for slow compilations
ssh user@devbox "time cl /c test.cpp"

Elastic Connection Failed

Error:

elasticsearch.exceptions.ConnectionError: 
  ConnectionError(<urllib3.connection.HTTPSConnection>: 
  Failed to establish a new connection)

Solutions:

# Check Elasticsearch is running
curl -sk -u elastic:password https://localhost:9200/_cluster/health | jq

# Check from training host
curl -sk -u elastic:password https://10.0.20.145:9200/_cluster/health

# Check Docker containers
docker ps | grep -E "(elasticsearch|kibana|fleet)"

# Restart if needed
cd elastic-container && docker compose restart

Elastic Authentication Failed

Error:

elasticsearch.exceptions.AuthenticationException: 
  AuthenticationException(401, 'security_exception')

Solutions:

# Reset elastic password
docker exec -it elasticsearch bin/elasticsearch-reset-password -u elastic

# Update config
vim configs/elastic_verifier.yaml
# Set new password

# Test authentication
curl -sk -u elastic:NEW_PASSWORD https://localhost:9200/

Training Issues

Out of Memory (OOM)

Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 2.00 GiB (GPU 0; 127.98 GiB total capacity; 
118.45 GiB already allocated; 1.23 GiB free)

Solutions:

# Reduce batch size
training:
  batch_size: 1                    # Minimum
  gradient_accumulation_steps: 32   # Compensate
  gradient_checkpointing: true      # Trade compute for memory

generation:
  batch_size: 4   # Reduce from 8
# Add explicit cleanup between phases
import gc
import torch

gc.collect()
torch.cuda.empty_cache()
# Close memory-heavy applications before training
# Cursor IDE uses significant memory

OOM Kills Entire System

Error:

Killed (exit code 137)
# Or system becomes unresponsive

Solutions:

# Monitor memory during training
watch -n 1 "free -h"

# Set OOM score adjustment (protects process)
echo -1000 > /proc/$(pgrep -f raft_trainer)/oom_score_adj

# Use swap as backup
sudo fallocate -l 32G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Loss Not Decreasing

Symptoms:

  • Training loss stays flat
  • Eval loss increases
  • Very slow progress

Solutions:

# Try different learning rates
training:
  learning_rate: 1e-5   # Lower (try first)
  # or
  learning_rate: 1e-4   # Higher

# Check warmup
training:
  warmup_ratio: 0.1    # More warmup
# Verify data is loaded correctly
python -c "
from malagent.sft.trainer import SFTTrainer
t = SFTTrainer()
ds = t.load_dataset('data/sft_examples.jsonl')
print(f'Loaded {len(ds[0])} examples')
print(ds[0][0])  # Print first example
"

Very Low Compile Rate (<5%)

Symptoms:

  • Almost no samples pass compilation
  • Training makes no progress

Solutions:

# Check compiler errors for patterns
grep -h "error C" output/raft/cycle_1/samples.jsonl | \
    sed 's/.*error C[0-9]*://' | sort | uniq -c | sort -rn | head -20

# Common patterns:
# - "undeclared identifier" → Missing #include
# - "unresolved external" → Missing library linkage
# - "syntax error" → Bad code structure
# Verify MSVC configuration
compiler:
  vcvars_path: "C:\\Program Files\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Auxiliary\\Build\\vcvars64.bat"

Model Loading Failure

Error:

OSError: Can't load tokenizer for 'output/sft/final'. 
If you were trying to load it from 'https://huggingface.co/models'...

Solutions:

# Check checkpoint contents
ls -la output/sft/final/
# Should contain: tokenizer.json, tokenizer_config.json, adapter_model.safetensors

# If tokenizer missing, copy from base model
cp ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-7B/snapshots/*/tokenizer* output/sft/final/

Checkpoint Loading Mismatch

Error:

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
  size mismatch for base_model.model.layers.0.self_attn.q_proj.lora_A.default.weight

Solutions:

# Verify LoRA config matches
cat output/sft/final/adapter_config.json | jq '.r, .lora_alpha'
# Must match current config

# If mismatch, start fresh or update config

Verification Issues

No Alerts Appearing

Symptoms:

  • All samples get reward=1.0 (evaded)
  • But samples should be detected

Solutions:

# Check Elastic Agent status in Fleet
# Kibana → Fleet → Agents → Check status

# Verify detection rules are enabled
# Kibana → Security → Rules → Check enabled count

# Test alert generation manually
curl -sk -u elastic:password \
    "https://localhost:9200/.alerts-security.alerts-default/_count" | jq

# Increase detection timeout
detection:
  timeout: 180      # Increase from 120
  poll_interval: 10  # More frequent polling

Samples Not Executing on VMs

Error:

Execution failed: Binary not found on VM

Solutions:

# Test SSH connection to VM
ssh -i ~/.ssh/win labuser@10.0.20.101 "whoami"

# Check staging directory exists
ssh labuser@vm-ip "if exist C:\MalwareStaging\samples echo OK"

# Check transfer succeeded
ssh labuser@vm-ip "dir C:\MalwareStaging\samples"

All Samples Getting Same Reward

Symptoms:

  • Every sample gets exactly 0.5 or 0.6
  • No variation in rewards

Solutions:

# Check correlation is working
# Add debug logging
import logging
logging.getLogger('malagent.verifiers').setLevel(logging.DEBUG)

# Verify sample IDs are unique
# Check alerts include sample correlation markers

Detection Timeout Exceeded

Error:

TimeoutError: Detection polling exceeded 120s

Solutions:

# Increase timeout
detection:
  timeout: 180
  rule_delay: 45   # More initial delay
# Check Elasticsearch performance
curl -sk -u elastic:password \
    "https://localhost:9200/_cluster/stats?human" | jq '.indices'

Environment Issues

ROCm Not Detected

Error:

RuntimeError: No HIP GPUs are available

Solutions:

# Verify ROCm installation
rocm-smi

# Check permissions
groups  # Should include 'video' and 'render'
sudo usermod -aG video,render $USER
# Logout and login again

# Inside toolbox, check device access
ls -la /dev/kfd /dev/dri

Wrong PyTorch Version

Error:

ImportError: cannot import name 'flash_attn_func' from 'flash_attn'

Solutions:

# Check PyTorch ROCm build
python -c "import torch; print(torch.version.hip)"

# Reinstall if wrong
pip install torch --index-url https://download.pytorch.org/whl/rocm6.0

Module Not Found

Error:

ModuleNotFoundError: No module named 'malagent'

Solutions:

# Install in development mode
cd /path/to/malagent
pip install -e .

# Verify installation
python -c "from malagent.cli import main; print('OK')"

# Check PYTHONPATH if needed
export PYTHONPATH=/path/to/malagent:$PYTHONPATH

Debug Mode

Enable debug logging for detailed troubleshooting:

# Set environment variable
export MALAGENT_LOG_LEVEL=DEBUG

# Or in code
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger('malagent').setLevel(logging.DEBUG)

Getting Help

If issues persist:

  1. Collect logs:

    MALAGENT_LOG_LEVEL=DEBUG python malagent/training/raft_trainer.py 2>&1 | tee debug.log
    
  2. Check component status:

    ./toolbox/verify.sh
    
  3. Open issue on GitHub with:

    • Error message and full traceback
    • Configuration (redact passwords)
    • Hardware specifications
    • Steps to reproduce