GitHub: carlkibler/citadel-llm-security-lab
A single-day security sprint that evolved a prompt injection scanner through nine generations, with red team attacks driving each iteration. The repo is an archive — the full history of a security arms race run start-to-finish in one session.
The project was also an experiment in multi-agent security work: Gauss (the defender) architected each version, while separate attack agents (Opus, Sonnet, Flash) probed for bypasses. The red team found a real breach in every version through V7.
The Evolution
V1–V2: Pattern matching hits its ceiling. The initial scanner used regex and entropy checks. Fast, but trivially bypassed: Unicode homoglyphs, character spacing, and semantic synonyms all sailed through. The lesson: character-level analysis cannot catch intent.
V3–V4: LLM-based detection + session memory. Integrating an LLM for intent detection caught what regex missed — but only in isolation. Multi-turn attacks that spread a payload across several conversation turns bypassed per-turn analysis. V4 added stateful session tracking: cumulative risk scores, behavioral baselines, vocabulary drift detection.
V5–V6: Structural analysis. A process substitution attack in V4 led to the "Zero-Ghost" sibling script check — parse the command AST before execution to detect unexpected child processes. V6 added 8-view parallel analysis: analyze the same input through simultaneous transformations (lowercase, stripped punctuation, Unicode normalization, skeleton extraction). +30% detection improvement.
V7: Scientific clarity. The 8-view approach created false positives on legitimate scientific notation — Greek symbols (α, μ, Δ) in actual scientific contexts flagging as homoglyph attacks. V7 added a scientific exception list and skeleton-based detection (strips spacing to catch obfuscated patterns). All tests passed. Behavioral delta analysis (30% distribution shift threshold) added for anomaly detection.
V8: Semantic intent. The real insight from the red team: pattern matching protects input, but tool output is the real attack surface. Workspace files, web fetches, exec results — any of these can carry a payload that reaches the agent's context without passing through input scanning. V8 added LLM-powered semantic intent analysis and a cross-session assembly detector.
V9 (blueprint): JIT tool-output scanning, trust-tiered filesystem partitioning, outcome prediction before tool execution. The design phase — not yet implemented.
Key Findings
The practical takeaway from running this sprint is architectural: prompt injection defenses that only filter user input are defending the wrong perimeter. Once an agent has tool use — file reads, web fetches, shell execution — the attack surface is everything those tools can return.
The arms race also illustrated alert fatigue. V6's high sensitivity generated false positives that would make the system unusable in production. V7's scientific exception handling wasn't a concession to attackers; it was a requirement for real-world deployment. Security systems that cry wolf get disabled.
The full red team reports, attack demonstrations, and scanner implementations are in the repo. The CHRONICLES.md has the narrative history of each generation in detail.