AgentToolBench-Code – security benchmark for AI coding agents
Expands corpus to 16 CVE-anchored scenarios to break model ties.
First systematic attack framework proving 7/9 exploits work on AI agents with shell access.
AI/ML engineers, security researchers, developers building autonomous agents with code execution
OWASP LLM Top 10 · Prompt Injection benchmarks (HuggingFace, Anthropic's red-teaming) · Container escape test suites
Expands corpus to 16 CVE-anchored scenarios to break model ties.
Agent red-teaming via UI, but attack catalog is shallow and comparison unclear vs. manual testing.
The author walks the reader through a full exploit chain that starts with a UX/trust-boundary trick and ends in RCE by causing a client to connect to an attacker gateway, leak a token, and reconfigure the agent’s execution environment. It's a sharp systems narrative that will change how you think about agents crossing chat, browser, and local tooling — excellent reading for defenders and attacker-minded engineers, but it's an investigative article rather than a ship-or-tool.
Maps cross-agent injection attacks to real Copilot CVEs with live measurements.
Benchmarked dead code finder across FastAPI, Pydantic, Flask—but Vulture, Bandit already solve this.
Scanner benchmarking for DAST tools. DVWA and Juice Shop dominate security training.