Benchmarking LLMs through autonomous games of Blood on the Clocktower

Name: Benchmarking LLMs through autonomous games of Blood on the Clocktower
Availability: InStock
Author: cjami

by cjami·Mar 30, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerRabbit HoleCrowd PleaserZero to One

Social deduction games test deception and theory of mind better than standard benchmarks.

Strengths

•797 real matches with mirrored games eliminate luck bias in model-vs-model comparisons
•Tests nuanced capabilities like coordination and deception that MMLU completely misses
•Live leaderboard with win rates, role performance, and featured game replays

Weaknesses

•Small sample sizes per model (20-74 matches) limit statistical significance
•No API or export option for researchers to analyze raw game data

Similar Projects

AI/ML●●Solid

Buyout Game Benchmark: Multi-Agent Bargaining, Transfers, and Takeovers

Wealth-based scoring reveals strategic failures that survival-only benchmarks miss.

Big BrainNiche Gem

zone411

602mo ago

AI/ML●Mid

CivBench a long-horizon AI benchmark for multi-agent games

Civilization matches expose model divergence that static benchmarks miss—but it's a spectacle, not a measurement.

Rabbit HoleBig Brain

mbh159

12243mo ago

AI/ML●Mid

Ebbforge - 10M agent Rust swarm engine, 8 fundamental benchmarks

Rust swarm vs LLM agents is clever positioning, but benchmarks are self-designed and lack third-party validation.

Big BrainWizardry

agent-world

213mo ago

AI/ML●●Solid

NetHack agent harness with benchmarks and livestream

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Niche GemWizardry

kenforthewin

114mo ago

AI/ML●●●Banger

Republic of Agents: Benchmark for Social Reasoning in LLMs

Mafia-as-benchmark with learning-between-batches mechanism; public, inspectable sessions.

Zero to OneBig BrainWizardry

kkonstantin

103mo ago

AI/ML●●●Banger

LLM Sycophancy Benchmark: Opposite-Narrator Contradictions

Opposite-narrator test catches models agreeing with both sides of same dispute.

Big BrainDark Horse

zone411

303mo ago