Back to browse
GitHub Repository
2 starsPython

BreakMyAgent – Open-source red-teaming sandbox for LLM system prompts

by breakmyagent·Feb 26, 2026·2 points·0 comments

AI Analysis

●●SolidSolve My ProblemShip It

LLM-as-Judge red-teaming for system prompts, but Anthropic/OpenAI already ship this internally.

Strengths
  • Concurrent multi-model attack surface in one UI; supports OpenAI, Anthropic, OpenRouter models.
  • LLM-as-Judge detection uses strict alignment rules and prompt caching for repeatable evals.
  • Zero-friction local setup with uv; API key detection auto-populates model list dynamically.
Weaknesses
  • Only 12 baseline exploits—serious threats use sophisticated multi-step agentic payloads not covered.
  • Pro tier upsell (50+ vectors, CI/CD API) suggests open-source version is deliberately limited.
Category
Target Audience

AI engineers and teams building LLM-based agents and chatbots

Similar To

Anthropic's internal red-teaming suite · OpenAI's moderation API · Pydantic's guardrails tooling

Post Description

As a developer, I got tired of manually testing my AI agents and chatbots against the same prompt injections and jailbreaks every time I tweaked a system prompt. Our QA team was struggling with the exact same bottleneck, so I built BreakMyAgent.

It’s an open-source sandbox that runs an automated barrage of standard exploits against your target LLM to see if it leaks data or ignores core instructions.

How it works under the hood: - The UI is built with Streamlit, backend is FastAPI, and dependency management is handled by `uv`. - You paste your system prompt and hit run. It fires 12 baseline attack vectors (Direct leaks, XSS payloads, Context overflows, etc.) concurrently. - The core mechanic is "LLM-as-a-Judge". It uses a hardcoded `gpt-4.1-mini` with strict alignment rules to systematically evaluate the target's responses. - It supports OpenAI, Anthropic, and a solid list of open-weight models via OpenRouter (including DeepSeek V3/R1, Qwen 2.5, and Llama 3.3).

There is a hosted free version if you want to play with it immediately (I capped it at 15 requests/IP to survive the launch), but the entire tool is open-source and takes 30 seconds to spin up locally with Docker or `uv`.

Repo: https://github.com/BreakMyAgent/breakmyagent-os Live demo: https://breakmyagent.dev

Next on the roadmap: I'm building a dedicated CLI/GitHub Action so teams can drop this into their own CI/CD pipelines to block prompt regressions. I'm also developing a PoC for multi-turn agentic fuzzing and expanding the payload database for complex tool-spoofing.

I’d love to hear your feedback! What other test configurations (besides temperature and response format) do you think are essential for a tool like this? Also open to any feedback on the architecture, the judge prompt, or specific zero-day vectors you'd like to see included in the public database.

Similar Projects

Security●●Solid

Z3r0 – Multi-agent red team collaboration platform

Docker-sandboxed agent orchestration for red teams joins a crowded automated pentesting space.

Niche GemShip ItBold Bet
yv1ing
2010d ago
Security●●Solid

OdinForge – Breach simulation that chains vulns into attack paths

The UI turns complex attack chains into an immediately scannable graph with per-path metrics (risk score, time-to-compromise, assets/credentials impacted) — great for threat modeling and tabletop drills. Feels more like a very polished BAS visualization than a novel research tool; what I want to know next is where the simulation inputs come from (real telemetry, vulnerability feeds, or canned scenarios).

SlickNiche Gem
Doc_Dre
214mo ago