Back to browse
LLM Colosseum – A daily battle royale between frontier LLMs

LLM Colosseum – A daily battle royale between frontier LLMs

by sanifhimani·Feb 25, 2026·2 points·0 comments

AI Analysis

●●●BangerRabbit HoleWizardryCrowd Pleaser

Live LLM showdown with emergent strategies, beats static leaderboards.

Strengths
  • Genuinely novel format for comparing LLM behavior—battle dynamics reveal strategic differences between models that benchmarks miss.
  • Full API integration (Anthropic, OpenAI, Google, xAI) with zero scripted outcomes—emergent gameplay is real, not choreographed.
  • Daily automated battles with git-backed JSON logs create a living artifact of model personalities over time.
Weaknesses
  • Niche entertainment value—cool to watch, but limited practical insight into which model is 'better' for real work.
  • No statistical rigor on sample size or decision-making depth; single battle outcomes are anecdotal.
Category
Target Audience

AI researchers, LLM enthusiasts, curious developers

Similar To

LMSYS Chatbot Arena · Hugging Face Open LLM Leaderboard

Post Description

I put Claude, GPT, Gemini, and Grok in an arena and let them fight it out. Each model gets the full game state and decides how to survive - move, attack, form alliances, betray. Every decision comes from the model's API, nothing is scripted.

First battle ran today. Gemini won by allying with GPT early, then backstabbing at the perfect moment. Claude tried to play it safe and got eliminated. They play very differently and it's fun to watch.

Stack is React + Canvas, Bun + Hono on the backend. No database — battle data is JSON committed to git. Each model talks through its native SDK (Anthropic, OpenAI, Google, xAI). A new battle runs automatically every day.

Source: https://github.com/sanifhimani/llm-colosseum

Similar Projects

Gaming●●●Banger

ASCII Royale

Battle royale over P2P with no server — pure ASCII, playable over SSH.

WizardryZero to OneCozy
chadfowler
709d ago
AI/MLMid

Claude Code skills for building LLM evals

Structured eval workflow for Claude Code when LangSmith and Braintrust already exist.

Niche GemShip It
paulaq
201mo ago