97% on SWE-bench Verified with subscription-token agents
97% on SWE-bench Verified with full artifact transparency, not just a score claim.

They split responsibilities across isolated agents (engineer, reviewer, manager) that get real shell access and independent filesystems, which makes failures traceable and lets you tune model capacity per role. Hitting 72.2% on SWE-bench Verified with no benchmark-specific tuning is an impressive empirical result — interesting architecture and strong evidence — though the security and long-term reliability of autonomous shell-executing agents remain the big open questions.
Backend/frontend engineers, engineering managers, developer-tool builders, AI researchers
97% on SWE-bench Verified with full artifact transparency, not just a score claim.
LLM judge on outgoing requests achieves 0% cheat rate while preserving 58% fair-solve ceiling.
AI support agents syncing GitHub, email, and forums into one ticket workspace.
Transparent proxy cuts Codex context tokens by 87% via working memory.
Twitter thread with a chart; not a product or tool.
Another AI issue tracker when Intercom and Zendesk already do this.