Back to browse
Benchmarking LLMs through autonomous games of Blood on the Clocktower

Benchmarking LLMs through autonomous games of Blood on the Clocktower

by cjami·Mar 30, 2026·1 point·0 comments

AI Analysis

●●●BangerRabbit HoleCrowd PleaserZero to One

Social deduction games test deception and theory of mind better than standard benchmarks.

Strengths
  • 797 real matches with mirrored games eliminate luck bias in model-vs-model comparisons
  • Tests nuanced capabilities like coordination and deception that MMLU completely misses
  • Live leaderboard with win rates, role performance, and featured game replays
Weaknesses
  • Small sample sizes per model (20-74 matches) limit statistical significance
  • No API or export option for researchers to analyze raw game data
Category
Target Audience

AI researchers, LLM developers, benchmarking enthusiasts

Similar To

LMArena · HELM · BigBench

Similar Projects

AI/MLMid

Ebbforge - 10M agent Rust swarm engine, 8 fundamental benchmarks

Rust swarm vs LLM agents is clever positioning, but benchmarks are self-designed and lack third-party validation.

Big BrainWizardry
agent-world
213mo ago
AI/ML●●Solid

NetHack agent harness with benchmarks and livestream

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Niche GemWizardry
kenforthewin
114mo ago