Back to browse
New Benchmark from SWE-bench team is 0% solved

New Benchmark from SWE-bench team is 0% solved

by lieret·May 5, 2026·24 points·3 comments

AI Analysis

●●●●GemBig BrainBold BetZero to One

Agents fail completely at rebuilding binaries from scratch without source code.

Strengths
  • 248,000 behavioral tests generated via fuzzing ensure rigorous evaluation.
  • Tasks range from simple utilities like jq to massive projects like FFmpeg.
  • Zero percent solve rate exposes the gap between coding and engineering.
Weaknesses
  • Extremely niche audience limited to agent infrastructure researchers.
  • Binary-only constraint may not reflect real-world engineering scenarios.
Category
Target Audience

AI researchers, LLM developers, Benchmark enthusiasts

Similar To

SWE-bench · HumanEval · LiveCodeBench

Similar Projects

AI/ML●●●Banger

97% on SWE-bench Verified with subscription-token agents

97% on SWE-bench Verified with full artifact transparency, not just a score claim.

Big BrainZero to One
kimjune01
2028d ago