I benchmarked how good LLMs are at proofreading English
Agent loop proofreading evals where HELM and LMSys are too generic.

51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.
AI researchers, developers selecting LLMs for text tasks
LMSys Chatbot Arena · HELM · LiveBench
Agent loop proofreading evals where HELM and LMSys are too generic.
Thinking time becomes game time lost — finally measures latency alongside accuracy.
First linter + benchmark for MCP servers; catches vague schemas before LLMs pick wrong tools.
Opposite-narrator test catches models agreeing with both sides of same dispute.
Cuts token costs 70% with receipts proving no accuracy drop on hard evals.
One-click LLM benchmarking with real tok/s metrics when llama.cpp requires manual setup.