Dataset Format
Structuring prompts and examples for training
Overview
malagent uses JSONL (JSON Lines) format for all datasets. Each line is a self-contained JSON object.
Prompt Dataset (RAFT)
Used during RAFT training to generate candidate samples.
Format
{"prompt": "Write a C++ function that uses NtAllocateVirtualMemory to allocate executable memory without using VirtualAlloc.", "category": "syscalls", "difficulty": 3}
{"prompt": "Implement Hell's Gate syscall resolution in C++ that finds syscall numbers from ntdll.", "category": "syscalls", "difficulty": 4}
Fields
| Field | Type | Required | Description |
|---|---|---|---|
prompt | string | Yes | The instruction for code generation |
category | string | No | Technique category for metrics |
difficulty | int | No | Complexity level (1-5) |
technique_id | string | No | MITRE ATT&CK ID (e.g., T1055) |
Category Guidelines
| Category | Example Techniques | Difficulty |
|---|---|---|
pe_basics | DOS header parsing, section enumeration | 1 |
api_resolution | GetProcAddress, export table walking | 2 |
memory | VirtualAlloc alternatives, section mapping | 2-3 |
syscalls | Direct, indirect, Hell’s Gate, Halo’s Gate | 3-4 |
injection | Process hollowing, APC, thread hijack | 4-5 |
evasion | Unhooking, ETW bypass, AMSI patch | 5 |
SFT Examples Dataset
Used for supervised fine-tuning to establish baseline capability.
Format
{
"instruction": "Write a C++ function that uses NtAllocateVirtualMemory to allocate executable memory.",
"completion": "#include <windows.h>\n#include <winternl.h>\n\ntypedef NTSTATUS (NTAPI* pNtAllocateVirtualMemory)(\n HANDLE ProcessHandle,\n PVOID* BaseAddress,\n ULONG_PTR ZeroBits,\n PSIZE_T RegionSize,\n ULONG AllocationType,\n ULONG Protect\n);\n\nint main() {\n HMODULE ntdll = GetModuleHandleA(\"ntdll.dll\");\n pNtAllocateVirtualMemory NtAllocateVirtualMemory = \n (pNtAllocateVirtualMemory)GetProcAddress(ntdll, \"NtAllocateVirtualMemory\");\n \n PVOID baseAddress = NULL;\n SIZE_T regionSize = 4096;\n \n NTSTATUS status = NtAllocateVirtualMemory(\n GetCurrentProcess(),\n &baseAddress,\n 0,\n ®ionSize,\n MEM_COMMIT | MEM_RESERVE,\n PAGE_EXECUTE_READWRITE\n );\n \n return status == 0 ? 0 : 1;\n}",
"category": "memory"
}
Fields
| Field | Type | Required | Description |
|---|---|---|---|
instruction | string | Yes | The prompt/instruction |
completion | string | Yes | The expected C++ code output |
category | string | No | Technique category |
Completion Requirements
Completions should:
- Be complete — Full compilable programs with
main()or complete functions - Include headers — All necessary
#includestatements - Use proper types — Windows-specific types (
HANDLE,PVOID, etc.) - Handle errors — Basic error checking where appropriate
Example: Good Completion
#include <windows.h>
#include <stdio.h>
typedef NTSTATUS (NTAPI* pNtQuerySystemInformation)(
ULONG SystemInformationClass,
PVOID SystemInformation,
ULONG SystemInformationLength,
PULONG ReturnLength
);
int main() {
HMODULE ntdll = GetModuleHandleA("ntdll.dll");
if (!ntdll) {
printf("Failed to get ntdll handle\n");
return 1;
}
pNtQuerySystemInformation NtQuerySystemInformation =
(pNtQuerySystemInformation)GetProcAddress(ntdll, "NtQuerySystemInformation");
if (!NtQuerySystemInformation) {
printf("Failed to resolve NtQuerySystemInformation\n");
return 1;
}
// Use the function...
return 0;
}
Example: Bad Completion
// Missing headers
// Incomplete function
void doSomething() {
NtAllocateVirtualMemory(...); // Won't compile
}
Code Extraction
malagent extracts code from model completions using pattern matching.
Extraction Priority
Fenced code blocks (preferred)
```cpp #include <windows.h> int main() { return 0; }Naked code (starts with
#includeor known patterns)#include <windows.h> int main() { return 0; }XML-style tags
<code> #include <windows.h> int main() { return 0; } </code>
Extraction Patterns
EXTRACTION_PATTERNS = [
# Fenced code blocks (cpp, c++, c)
r"```(?:cpp|c\+\+|c)?\s*\n(.*?)```",
# Code starting with #include
r"(#include\s*<[^>]+>.*?)(?=\n\n[A-Z]|\n\n\*|\Z)",
# Code starting with typedef
r"(typedef\s+.*?(?:int\s+main\s*\([^)]*\)\s*\{.*?\}|\};))",
# XML-style code tags
r"<code>(.*?)</code>",
]
Handling Multiple Blocks
When multiple code blocks exist, malagent uses the longest valid block that:
- Contains
#includeortypedef - Has balanced braces
- Ends with
}or};
Dataset Preparation Scripts
Convert from Other Formats
# convert_to_jsonl.py
import json
def convert_csv_to_jsonl(csv_path, output_path):
import csv
with open(csv_path) as f, open(output_path, 'w') as out:
reader = csv.DictReader(f)
for row in reader:
json.dump({
"prompt": row["prompt"],
"category": row.get("category", "unknown")
}, out)
out.write('\n')
Validate Dataset
# validate_dataset.py
import json
def validate_prompts(path):
errors = []
with open(path) as f:
for i, line in enumerate(f, 1):
try:
obj = json.loads(line)
if "prompt" not in obj:
errors.append(f"Line {i}: missing 'prompt' field")
if len(obj.get("prompt", "")) < 20:
errors.append(f"Line {i}: prompt too short")
except json.JSONDecodeError as e:
errors.append(f"Line {i}: invalid JSON - {e}")
return errors
Dataset Statistics
Track dataset composition for balanced training:
# Count by category
jq -r '.category' data/prompts.jsonl | sort | uniq -c | sort -rn
# Check difficulty distribution
jq -r '.difficulty' data/prompts.jsonl | sort | uniq -c
Example output:
156 syscalls
142 injection
98 memory
87 api_resolution
52 pe_basics
34 evasion