Back to browse
GitHub Repository

Reproducible recon/craft/audit agent pipeline for SWE-bench Verified. Official-graded, codex-attested, GPL-3.0. Run it yourself.

8 starsShell

97% on SWE-bench Verified with subscription-token agents

by kimjune01·May 24, 2026·2 points·0 comments

AI Analysis

●●●BangerBig BrainZero to One

97% on SWE-bench Verified with full artifact transparency, not just a score claim.

Strengths
  • Sankey diagram traces all 500 instances to visible outcomes, no silent drops.
  • External-fault re-runs logged separately from reasoning losses for honest auditing.
  • GPL-3.0 repo lets anyone clone and verify the exact procedure end-to-end.
Weaknesses
  • 44 sphinx-doc instances excluded as offline-infeasible, limiting full-set coverage.
  • Relies on Claude Sonnet and GPT-5.5, both potentially contaminated by benchmark data.
Category
Target Audience

AI researchers and SWE-bench skeptics

Similar To

SWE-agent · OpenDevin · Aider

Similar Projects

Developer Tools●●Solid

We achieved 72.2% issue resolution on SWE-bench Verified using AI teams

They split responsibilities across isolated agents (engineer, reviewer, manager) that get real shell access and independent filesystems, which makes failures traceable and lets you tune model capacity per role. Hitting 72.2% on SWE-bench Verified with no benchmark-specific tuning is an impressive empirical result — interesting architecture and strong evidence — though the security and long-term reliability of autonomous shell-executing agents remain the big open questions.

WizardryBig Brain
NBenkovich
204mo ago