BrokenClaw Part 5: GPT-5.4 Edition (Prompt Injection)
GPT-5.4 executes untrusted code from fetched pages despite security countermeasures in place.

Demonstrates RCE in AI agents by bypassing untrusted content tags via fake redirects.
AI developers, Security researchers
Gandalf · Lakera Guard · PromptInject
GPT-5.4 executes untrusted code from fetched pages despite security countermeasures in place.
It actually looks for the weird stuff that trips up LLM agents — invisible Unicode, bidi overrides, embedded curl|bash one-liners, exfil links — and pairs a static skill scanner with a real-time interception flow that forces human approvals. The CLI-first approach (npx safeclaw start) plus Socket.IO alerts and per-command allow/deny decisions show practical thinking about developer workflows; I want to see model/false-positive metrics and enterprise integration docs next.
The two-layer approach — a code plugin for gates/hardening plus a tiny ~1,230-token LLM skill for behavioral rules — is smart and practical. I appreciate that detection runs in bash (no token bloat) and that they mapped concrete checks to OWASP ASI and MITRE frameworks; the tradeoff is obvious: this is highly valuable if you run OpenClaw, but mostly irrelevant outside that ecosystem.
Blocks prompt injection before execution when Anthropic's filters won't.
First automated red teaming for agentic AI at scale—enterprise gap now weaponized.
Isolated LLM with no tools or memory makes prompt injection hit a dead end.