Meta-agent: self-improving agent harnesses from live traces
Iteratively improves agent harnesses from 67% to 87% on tau-bench using production traces.
Diagnose your AI agents in production. Extract policies from prompts, evaluate traces, generate diagnostic reports.
Replays agent traces step-by-step to pinpoint exact failure turns automatically.
AI engineers, backend developers building agent systems
LangSmith · Langfuse · Arize
I was spending way too much time staring at traces, logs and dashboards trying to figure out why my multi-agent setups kept failing.
You just point it at your traces (LangSmith, Langfuse, OpenTelemetry, or a JSON file). It pulls the system prompts directly from the logs, extracts the behavioral rules, and uses an LLM-as-a-judge to replay each conversation step-by-step.
It flags exactly which turn broke things, which agent caused it, and traces cascading failures across routing, handoffs, and retrieval.
It aggregates root causes across all of them: "24 out of 51 failures are missing escalations." You know exactly what to fix first.
Runs locally. Only LLM API calls leave your machine.
npx agent-triage demo — runs on sample data, uses your own API key (~$0.002/conversation with gpt-4o-mini).
https://github.com/converra/agent-triage Demo report: https://demo-report-sigma.vercel.app/
Iteratively improves agent harnesses from 67% to 87% on tau-bench using production traces.
One binary with embedded dashboard beats LangSmith's closed-source SaaS for Go teams.
Multi-cloud diagnosis in <30s, but infra observability (Datadog, New Relic) already solves this better.
Heuristic signals triage agent traces 1.52x more efficiently than random sampling.
Git for agent cognition—clever framework, but no working implementation yet.
Free audit funnel for AI observability when LangSmith and Helicone already do this.