Digest AI vs HN About

We achieved 72.2% issue resolution on SWE-bench Verified using AI teams

We achieved 72.2% issue resolution on SWE-bench Verified using AI teams

by NBenkovich·Feb 12, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryBig Brain

The Take

They split responsibilities across isolated agents (engineer, reviewer, manager) that get real shell access and independent filesystems, which makes failures traceable and lets you tune model capacity per role. Hitting 72.2% on SWE-bench Verified with no benchmark-specific tuning is an impressive empirical result — interesting architecture and strong evidence — though the security and long-term reliability of autonomous shell-executing agents remain the big open questions.

Category

Developer Tools

Target Audience

Backend/frontend engineers, engineering managers, developer-tool builders, AI researchers

Similar Projects

AI/ML●●●Banger

97% on SWE-bench Verified with subscription-token agents

97% on SWE-bench Verified with full artifact transparency, not just a score claim.

Big BrainZero to One

kimjune01

2028d ago

AI/ML●●●Banger

RewardHackBench: Using sandboxes to stop agents from cheating

LLM judge on outgoing requests achieves 0% cheat rate while preserving 58% fair-solve ceiling.

Big BrainDark Horse

rotemtam

934d ago

SaaS●Mid

One AI agent for all your support issues

AI support agents syncing GitHub, email, and forums into one ticket workspace.

Slick

Daniel-Pan

3316d ago

Developer Tools●●Solid

Codex context bloat? 87% avg reduction on SWE-bench Verified traces

Transparent proxy cuts Codex context tokens by 87% via working memory.

Big BrainNiche Gem

george_ciobanu

1021mo ago

AI/ML○Pass

All the LM solutions on SWE-bench are bloated compared to humans

Twitter thread with a chart; not a product or tool.

lieret

103mo ago

SaaS●Mid

SeaTicket – AI agent that resolve issues from GitHub, email and forums

Another AI issue tracker when Intercom and Zendesk already do this.

Slick

Daniel-Pan

2418d ago