ErrataBench - A Proofreading Benchmark for LLMs

Name: ErrataBench - A Proofreading Benchmark for LLMs
Availability: InStock
Author: artursapek

by artursapek·Apr 7, 2026·3 points·0 comments

AI Analysis

●●SolidNiche GemBig Brain

51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.

Strengths

•Distinguishes omissions from bad fixes, giving actionable failure mode breakdowns.
•Five days of runtime with transparent methodology and cost tracking throughout.
•Rankings include efficiency metrics, not just raw accuracy percentages.

Weaknesses

AI/ML●●Solid

Agent loop proofreading evals where HELM and LMSys are too generic.

Solve My ProblemShip It

artursapek

321mo ago

AI/ML●●Solid

Thinking time becomes game time lost — finally measures latency alongside accuracy.

Big BrainNiche Gem

bombastic311

302d ago

First linter + benchmark for MCP servers; catches vague schemas before LLMs pick wrong tools.

Solve My ProblemNiche GemBig Brain

yamarldfst

104mo ago

AI/ML●●●Banger

Opposite-narrator test catches models agreeing with both sides of same dispute.

Big BrainDark Horse

zone411

303mo ago

AI/ML●●●Banger

Cuts token costs 70% with receipts proving no accuracy drop on hard evals.

Zero to OneSolve My Problem

Jbunga

56331mo ago

AI/ML●●Solid

One-click LLM benchmarking with real tok/s metrics when llama.cpp requires manual setup.

Ship ItSolve My Problem

JoniMartin

2013d ago