Dataset Format

Structuring prompts and examples for training

Overview

malagent uses JSONL (JSON Lines) format for all datasets. Each line is a self-contained JSON object.

Prompt Dataset (RAFT)

Used during RAFT training to generate candidate samples.

Format

{"prompt": "Write a C++ function that uses NtAllocateVirtualMemory to allocate executable memory without using VirtualAlloc.", "category": "syscalls", "difficulty": 3}
{"prompt": "Implement Hell's Gate syscall resolution in C++ that finds syscall numbers from ntdll.", "category": "syscalls", "difficulty": 4}

Fields

Field	Type	Required	Description
`prompt`	string	Yes	The instruction for code generation
`category`	string	No	Technique category for metrics
`difficulty`	int	No	Complexity level (1-5)
`technique_id`	string	No	MITRE ATT&CK ID (e.g., T1055)

Category Guidelines

Category	Example Techniques	Difficulty
`pe_basics`	DOS header parsing, section enumeration	1
`api_resolution`	GetProcAddress, export table walking	2
`memory`	VirtualAlloc alternatives, section mapping	2-3
`syscalls`	Direct, indirect, Hell’s Gate, Halo’s Gate	3-4
`injection`	Process hollowing, APC, thread hijack	4-5
`evasion`	Unhooking, ETW bypass, AMSI patch	5

SFT Examples Dataset

Used for supervised fine-tuning to establish baseline capability.

Format

{
  "instruction": "Write a C++ function that uses NtAllocateVirtualMemory to allocate executable memory.",
  "completion": "#include <windows.h>\n#include <winternl.h>\n\ntypedef NTSTATUS (NTAPI* pNtAllocateVirtualMemory)(\n    HANDLE ProcessHandle,\n    PVOID* BaseAddress,\n    ULONG_PTR ZeroBits,\n    PSIZE_T RegionSize,\n    ULONG AllocationType,\n    ULONG Protect\n);\n\nint main() {\n    HMODULE ntdll = GetModuleHandleA(\"ntdll.dll\");\n    pNtAllocateVirtualMemory NtAllocateVirtualMemory = \n        (pNtAllocateVirtualMemory)GetProcAddress(ntdll, \"NtAllocateVirtualMemory\");\n    \n    PVOID baseAddress = NULL;\n    SIZE_T regionSize = 4096;\n    \n    NTSTATUS status = NtAllocateVirtualMemory(\n        GetCurrentProcess(),\n        &baseAddress,\n        0,\n        &regionSize,\n        MEM_COMMIT | MEM_RESERVE,\n        PAGE_EXECUTE_READWRITE\n    );\n    \n    return status == 0 ? 0 : 1;\n}",
  "category": "memory"
}

Fields

Field	Type	Required	Description
`instruction`	string	Yes	The prompt/instruction
`completion`	string	Yes	The expected C++ code output
`category`	string	No	Technique category

Completion Requirements

Completions should:

Be complete — Full compilable programs with main() or complete functions
Include headers — All necessary #include statements
Use proper types — Windows-specific types (HANDLE, PVOID, etc.)
Handle errors — Basic error checking where appropriate

Example: Good Completion

#include <windows.h>
#include <stdio.h>

typedef NTSTATUS (NTAPI* pNtQuerySystemInformation)(
    ULONG SystemInformationClass,
    PVOID SystemInformation,
    ULONG SystemInformationLength,
    PULONG ReturnLength
);

int main() {
    HMODULE ntdll = GetModuleHandleA("ntdll.dll");
    if (!ntdll) {
        printf("Failed to get ntdll handle\n");
        return 1;
    }
    
    pNtQuerySystemInformation NtQuerySystemInformation = 
        (pNtQuerySystemInformation)GetProcAddress(ntdll, "NtQuerySystemInformation");
    
    if (!NtQuerySystemInformation) {
        printf("Failed to resolve NtQuerySystemInformation\n");
        return 1;
    }
    
    // Use the function...
    return 0;
}

Example: Bad Completion

// Missing headers
// Incomplete function
void doSomething() {
    NtAllocateVirtualMemory(...);  // Won't compile
}

Code Extraction

malagent extracts code from model completions using pattern matching.

Extraction Priority

Fenced code blocks (preferred)

```cpp
#include <windows.h>
int main() { return 0; }

Naked code (starts with #include or known patterns)
```
#include <windows.h>
int main() { return 0; }
```

XML-style tags

<code>
#include <windows.h>
int main() { return 0; }
</code>

Extraction Patterns

EXTRACTION_PATTERNS = [
    # Fenced code blocks (cpp, c++, c)
    r"```(?:cpp|c\+\+|c)?\s*\n(.*?)```",
    
    # Code starting with #include
    r"(#include\s*<[^>]+>.*?)(?=\n\n[A-Z]|\n\n\*|\Z)",
    
    # Code starting with typedef
    r"(typedef\s+.*?(?:int\s+main\s*\([^)]*\)\s*\{.*?\}|\};))",
    
    # XML-style code tags
    r"<code>(.*?)</code>",
]

Handling Multiple Blocks

When multiple code blocks exist, malagent uses the longest valid block that:

Contains #include or typedef
Has balanced braces
Ends with } or };

Dataset Preparation Scripts

Convert from Other Formats

# convert_to_jsonl.py
import json

def convert_csv_to_jsonl(csv_path, output_path):
    import csv
    with open(csv_path) as f, open(output_path, 'w') as out:
        reader = csv.DictReader(f)
        for row in reader:
            json.dump({
                "prompt": row["prompt"],
                "category": row.get("category", "unknown")
            }, out)
            out.write('\n')

Validate Dataset

# validate_dataset.py
import json

def validate_prompts(path):
    errors = []
    with open(path) as f:
        for i, line in enumerate(f, 1):
            try:
                obj = json.loads(line)
                if "prompt" not in obj:
                    errors.append(f"Line {i}: missing 'prompt' field")
                if len(obj.get("prompt", "")) < 20:
                    errors.append(f"Line {i}: prompt too short")
            except json.JSONDecodeError as e:
                errors.append(f"Line {i}: invalid JSON - {e}")
    return errors

Dataset Statistics

Track dataset composition for balanced training:

# Count by category
jq -r '.category' data/prompts.jsonl | sort | uniq -c | sort -rn

# Check difficulty distribution
jq -r '.difficulty' data/prompts.jsonl | sort | uniq -c

Example output:

    156 syscalls
    142 injection
     98 memory
     87 api_resolution
     52 pe_basics
     34 evasion