OpenClaw is a high-privilege AI agent framework. A running instance has access to the host filesystem, shell, and whatever external services the user has connected — email, documents, Slack, banking credentials. Users extend it by installing "skills" from ClawHub, a community skill marketplace.
The attack surface is significant. A malicious ClawHub skill runs with full agent privileges: it can exfiltrate files, steal credentials, install persistent system hooks, and send messages on the user's behalf. Because the agent processes external content — web pages, emails, fetched documents — injecting instructions into content the agent reads is a viable attack without any user interaction.
This experiment ran a guard scanner on all content entering the agent's context. The goal: prevent the agent from being subverted by content it retrieved. One day. Nine scanner versions. Twelve documented attack techniques.
9scanner versions
12documented attacks
0LLM API calls at scan time
<35msmedian scan latency
V1
Regex Pattern Matching
broke immediately
A list of dangerous patterns: ignore previous instructions, rm -rf, curl. Fast, explicit, zero ambiguity about what it catches.
Ten attacks. Ten bypasses. Spacing, Unicode homoglyphs, synonyms, base64, l33t-speak — none required more than surface-level knowledge that the scanner was pattern-matching literal strings.
Lesson learned
Exact-string patterns over natural language are enumerable. The attacker has infinite paraphrases of the same instruction.
V2
LLM-Guard Intent Scoring
bypassed
Replaced regex with an LLM-Guard PromptInjection scorer (local inference, no API calls). Added NFKC normalization and entropy analysis.
The Dots attack (I.g.n.o.r.e.) bypassed entropy. Base64-encoded commands bypassed intent. A multilingual sandwich — Chinese characters wrapping English instructions — confused the non-English threshold.
Lesson learned
An intent classifier scores tokens. It doesn't model what the sentence is asking for. Context matters; individual token scoring doesn't capture it.
V3
Multi-Layer Defense
design-only
Four-layer design: heuristic regex → LLM-Guard intent → OCR for image-embedded prompts → adversarial auditor LLM reasoning about all three outputs. No implementation survives in the archive.
NFKC Ghost: Cyrillic homoglyphs (Pуthоn -с "іmроrt оs") that look identical to Latin. NFKC normalization converted them to clean Python code — and the auditor classified it as benign developer content. The defense enabled the attack.
Shadow Word: Three-turn conditioning. Turn 1 defines a codeword for "bypass." Turn 3 asks for help with that codeword. Auditor evaluated Turn 3 in isolation — no session memory.
Lesson learned
Intent detection requires session context. A per-turn auditor is blind to multi-turn manipulation.
V4
Stateful Auditing
design-only
Added a 5-turn context window to the auditor. Cross-turn behavioral tracking and vocabulary drift detection. No implementation survives in the archive.
The 5-turn window closed the Shadow Word attack and blocked multi-turn conditioning vectors. One confirmed bypass: a single Cyrillic character substitution (followіng with Cyrillic і) fell under the entropy threshold because only one character deviated from Latin.
Lesson learned
Per-word script checking — not just entropy — is required to catch low-density homoglyph substitutions where only one character is non-Latin.
V5
Per-Word Mixed-Script Detection
bypassed
Per-word script consistency check: any word mixing Latin + Cyrillic, Latin + Greek, or Latin + Mathematical Alphanumerics flags at severity 10. Closed the Cyrillic homoglyph family completely.
Ancient script camouflage bypassed it 100%: Coptic ⲒⲄⲚⲞⲢⲈ is single-script — no mixing detected. Deseret, Osage, and Runic worked identically. DevOps Semantic Camouflage bypassed with plain English; no Unicode at all.
Lesson learned
Script-consistency checks ask "is this word mixing scripts?" not "is this script expected here?" An attacker using a single consistent non-Latin script — or no special characters at all — is invisible to this approach.
V6
8-View Parallel Transforms
scoring bug found
Input fans into eight simultaneous transformation views. The script allowlist replaced per-word checking — ancient scripts (Coptic, Runic) now caught. Seven l33t variants confirmed caught only via THE_PHONETIC.
Score Amplification Bug: plain text scored 60 (matched all 8 views). L33t-speak scored 20 (matched only THE_PHONETIC). Obfuscation produced a lower risk score. Skeleton Shattering: THE_SKELETON correctly collapsed spacing but V6 had no patterns searching the skeleton output.
Lesson learned
Score by severity ceiling, not sum across views. Build the skeleton pattern library. Add exemptions for known-legitimate scientific notation — 73% false-positive rate on biochemistry text is unusable in production.
V7
Skeleton Patterns + Scientific Exceptions
superscript bypass, delta poisoning
Four additions: (1) skeleton-specific pattern library operating on the collapsed skeleton output, (2) scientific exception list (α-helical, μg/mL, ΔG) downgraded from severity 10 to 2, (3) session-persistent high-water mark, (4) behavioral delta — flag if current-turn word frequency diverges from session history by >30%.
The spacing bypass closed. Scientific false positives dropped. Then two separate red-team runs found the two attacks in the widget below.
Lesson learned
Behavioral delta answers "did behavior change?" — not "is behavior dangerous?" A session that starts malicious and stays malicious shows delta = 0.0. Absolute toxicity tracking is required alongside delta.
Attack Replay
Select an attack to see how each scanner version processed it — all 8 transform views, detected patterns, and scores.
Payload:
8 Parallel Views
Scanner Results by Version
V8
Absolute Toxicity + Semantic Clustering
architectural gap
calculate_absolute_toxicity() tracks mean risk score across all session turns — independent of delta. Seven semantic concept clusters (INSTRUCTION_OVERRIDE, CREDENTIAL_EXFILTRATION, PERSONA_HIJACK, DATA_EXFILTRATION, DESTRUCTIVE_ACTION, PRIVILEGE_ESCALATION, CODE_INJECTION) catch synonym-based attacks.
Despite the README claiming "requires LLM API for semantic analysis," the implementation is pure heuristic — keyword lists and regex. No network calls. Semantic clustering via keyword co-occurrence is still useful; it doesn't require LLM inference.
Cross-Session Assembly: V8 scans input. Workspace files read via the agent's read tool bypass the scanner. An attacker who has written fragments to workspace files — via a web fetch, a user-supplied file, or a malicious ClawHub skill — can assemble a full instruction across reads without any single piece triggering the scanner.
The Sonnet Refusal
During V8 red-teaming, Claude Sonnet was dispatched with instructions to write a "primer that teaches a future model how to bypass V8." Sonnet declined. The argument: the primer would be cross-session attack documentation — exactly the attack class V8 was supposed to prevent. Sonnet also flagged that the subagent auto-announce pattern allowed red-team outputs to re-enter the main session without review. The argument was technically specific and correct. Read the full refusal →
V9
JIT Output Firewall
phase 1 operational
The architectural shift: scan what tool calls return, not just what the user sends. The JIT scanner intercepts tool output before it enters the context window.
Seven semantic clusters (SEM-001 through SEM-007: prompt injection, exfiltration, credential harvesting, command injection, path traversal, social engineering, obfuscation). Three response modes: PASS, SANITIZE (wrap with advisory), BLOCK (suppress).
Phase 1 operational. Phases 2–5 remain as design: tool execution layer hooks, audit logging, adaptive thresholds, LLM-enhanced detection. Phase 5 would be the first version in the series to make API calls at scan time.
The architectural insight
V1 through V8 defended the input boundary. V9 recognized that in an agent with tool use, the input boundary is the wrong perimeter.
V1–V8: Input scanning
V9: Output scanning
Every file read, web fetch, and exec result is an untrusted data source. V1–V8 assumed workspace storage was safe. Any ClawHub skill written to workspace files had a direct path to the agent's context.
Attack × Version Matrix
Every documented attack against every scanner version. Click any cell for details.