Cobalt – Unit tests for AI agents, like Jest but for LLMs
Jest for LLMs—CI-native eval that fails builds on quality drops, not dashboards.
Testing framework for AI agents with LLM judges and SQLite result tracking.
AI/LLM engineers, agent developers, teams building AI-powered applications
Braintrust · LangSmith · Humanloop
Jest for LLMs—CI-native eval that fails builds on quality drops, not dashboards.
pytest-native testing for AI agents with 101 built-in safety attack probes.
Qualitative eval workflow for PMs when LangSmith and Arize target ML engineers.
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.
Replays agent traces step-by-step to pinpoint exact failure turns automatically.
Iteratively improves agent harnesses from 67% to 87% on tau-bench using production traces.