New Benchmark from SWE-bench team is 0% solved

Name: New Benchmark from SWE-bench team is 0% solved
Availability: InStock
Author: lieret

by lieret·May 5, 2026·24 points·3 comments

AI Analysis

●●●●GemBig BrainBold BetZero to One

Agents fail completely at rebuilding binaries from scratch without source code.

Strengths

Weaknesses

Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Big BrainWizardryShip It

przadka

904mo ago

AI/ML●●●Banger

LLM judge on outgoing requests achieves 0% cheat rate while preserving 58% fair-solve ceiling.

Big BrainDark Horse

rotemtam

935d ago

AI/ML●●●Banger

62k puzzle benchmark reveals reasoning depth, cost variance, and stark US vs China model gaps.

Big BrainCrowd PleaserSolve My Problem

bluecoconut

503mo ago

AI/ML●●●Banger

97% on SWE-bench Verified with full artifact transparency, not just a score claim.

Big BrainZero to One

kimjune01

2028d ago

AI/ML●●●Banger

Using 1980s Rogue as an LLM benchmark is genuinely novel and technically clever.

WizardryZero to One

iwhalen

1027d ago

AI/ML●●●Banger

Tests agents on 700 policy docs and noisy voice calls where AgentBench stops.

Big BrainNiche Gem

victorbarres

1212mo ago