GitHub Repository

2 starsPython

BreakMyAgent – Open-source red-teaming sandbox for LLM system prompts

Name: BreakMyAgent – Open-source red-teaming sandbox for LLM system prompts
Availability: InStock
Author: breakmyagent

by breakmyagent·Feb 26, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidSolve My ProblemShip It

LLM-as-Judge red-teaming for system prompts, but Anthropic/OpenAI already ship this internally.

Strengths

•Concurrent multi-model attack surface in one UI; supports OpenAI, Anthropic, OpenRouter models.
•LLM-as-Judge detection uses strict alignment rules and prompt caching for repeatable evals.
•Zero-friction local setup with uv; API key detection auto-populates model list dynamically.

Weaknesses

•Only 12 baseline exploits—serious threats use sophisticated multi-step agentic payloads not covered.
•Pro tier upsell (50+ vectors, CI/CD API) suggests open-source version is deliberately limited.

Post Description

As a developer, I got tired of manually testing my AI agents and chatbots against the same prompt injections and jailbreaks every time I tweaked a system prompt. Our QA team was struggling with the exact same bottleneck, so I built BreakMyAgent.

It’s an open-source sandbox that runs an automated barrage of standard exploits against your target LLM to see if it leaks data or ignores core instructions.

How it works under the hood: - The UI is built with Streamlit, backend is FastAPI, and dependency management is handled by `uv`. - You paste your system prompt and hit run. It fires 12 baseline attack vectors (Direct leaks, XSS payloads, Context overflows, etc.) concurrently. - The core mechanic is "LLM-as-a-Judge". It uses a hardcoded `gpt-4.1-mini` with strict alignment rules to systematically evaluate the target's responses. - It supports OpenAI, Anthropic, and a solid list of open-weight models via OpenRouter (including DeepSeek V3/R1, Qwen 2.5, and Llama 3.3).

There is a hosted free version if you want to play with it immediately (I capped it at 15 requests/IP to survive the launch), but the entire tool is open-source and takes 30 seconds to spin up locally with Docker or `uv`.

Repo: https://github.com/BreakMyAgent/breakmyagent-os Live demo: https://breakmyagent.dev

Next on the roadmap: I'm building a dedicated CLI/GitHub Action so teams can drop this into their own CI/CD pipelines to block prompt regressions. I'm also developing a PoC for multi-turn agentic fuzzing and expanding the payload database for complex tool-spoofing.

I’d love to hear your feedback! What other test configurations (besides temperature and response format) do you think are essential for a tool like this? Also open to any feedback on the architecture, the judge prompt, or specific zero-day vectors you'd like to see included in the public database.

Similar Projects

Security●●Solid

Z3r0 – Multi-agent red team collaboration platform

Docker-sandboxed agent orchestration for red teams joins a crowded automated pentesting space.

Niche GemShip ItBold Bet

yv1ing

2010d ago

AI/ML●●Solid

We built an AI judge for a live hackathon, then red-teamed it

Multi-model ensemble scoring with Python-side arithmetic prevents LLM manipulation during live demos.

Niche GemShip ItBold Bet

theoradical

103mo ago

Security●●Solid

LLMSecure – prompt injection detection, no signup

Dynamic LLM sandbox analysis detects injections that static pattern matching tools miss.

Solve My ProblemSlick

eliadmualem

211mo ago

Security●●●Banger

We built free adversarial security testing for agents (OpenClaw too)

First automated red teaming for agentic AI at scale—enterprise gap now weaponized.

Big BrainWizardryBold Bet

aaronziosec

113mo ago

Security●●Solid

OdinForge – Breach simulation that chains vulns into attack paths

The UI turns complex attack chains into an immediately scannable graph with per-path metrics (risk score, time-to-compromise, assets/credentials impacted) — great for threat modeling and tabletop drills. Feels more like a very polished BAS visualization than a novel research tool; what I want to know next is where the simulation inputs come from (real telemetry, vulnerability feeds, or canned scenarios).

SlickNiche Gem

Doc_Dre

214mo ago

Security●●Solid

Only firewall for AI prompts with a security grade on every PR

One-line SDK swap + PR red-teaming with A-F grades—but 'detect-and-block' gateway category is well-funded.

SlickBig Brain

acebot712

213mo ago