GitHub Repository

Reproducible recon/craft/audit agent pipeline for SWE-bench Verified. Official-graded, codex-attested, GPL-3.0. Run it yourself.

8 starsShell

97% on SWE-bench Verified with subscription-token agents

Name: 97% on SWE-bench Verified with subscription-token agents
Availability: InStock
Author: kimjune01

by kimjune01·May 24, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainZero to One

97% on SWE-bench Verified with full artifact transparency, not just a score claim.

Strengths

•Sankey diagram traces all 500 instances to visible outcomes, no silent drops.
•External-fault re-runs logged separately from reasoning losses for honest auditing.
•GPL-3.0 repo lets anyone clone and verify the exact procedure end-to-end.

Weaknesses

•44 sphinx-doc instances excluded as offline-infeasible, limiting full-set coverage.
•Relies on Claude Sonnet and GPT-5.5, both potentially contaminated by benchmark data.

Similar Projects

Developer Tools●Mid

Salacia – The First Runtime OS for Agentic Coding

Fault-localization scaffolding for AI agents; claims 93% top-5 recall, but Cursor/Cline already integrate similar.

Big BrainBold Bet

alfredhua

203mo ago

Developer Tools●●Solid

Codex context bloat? 87% avg reduction on SWE-bench Verified traces

Transparent proxy cuts Codex context tokens by 87% via working memory.

Big BrainNiche Gem

george_ciobanu

1021mo ago

Security●●Solid

Agent Auditor: verify signed agent, API, and MCP records offline

Verifies AI agent receipts offline before the audit compliance headache actually starts.

Big BrainNiche Gem

jithinraj

112mo ago

Developer Tools●●Solid

We achieved 72.2% issue resolution on SWE-bench Verified using AI teams

They split responsibilities across isolated agents (engineer, reviewer, manager) that get real shell access and independent filesystems, which makes failures traceable and lets you tune model capacity per role. Hitting 72.2% on SWE-bench Verified with no benchmark-specific tuning is an impressive empirical result — interesting architecture and strong evidence — though the security and long-term reliability of autonomous shell-executing agents remain the big open questions.

WizardryBig Brain

NBenkovich

204mo ago

AI/ML●●●●Gem

New Benchmark from SWE-bench team is 0% solved

Agents fail completely at rebuilding binaries from scratch without source code.

Big BrainBold BetZero to One

lieret

2431mo ago

Developer Tools●●●Banger

Aether – Background agents that fix bugs in isolated VMs, opens PRs

AI agent actually fixes bugs in real VMs, not just prompting. Firecracker isolation + verified PRs.

WizardryBig BrainSolve My Problem

pranav9

873mo ago