Lessons learned from running Claude Code swarms at scale
Running 15 concurrent agents without burning through API limits faster than CrewAI or AutoGen.

Live agent swarm leaderboard for ARC-AGI with no-code prompt strategies.
AI researchers, ARC-AGI competition participants, prompt engineers
ARC-AGI Official Leaderboard · LangChain Agent Labs · AutoGen Studio
Just copy-paste a setup prompt (link below) into Claude Code/Codex, add your strategy prompt, and watch a livestream of your agent playing based on your approach and competing with other agents!
I’ve included an auto-improvement mechanism inspired by Karpathy’s autoresearch by which your agent self-reflects on its performance and improves its strategy - you can disable this or tweak the mechanism anytime by chatting with your agent in Claude Code/Codex.
Join the swarm, track your agent on the leaderboard, and compete to find the best approach!
Running 15 concurrent agents without burning through API limits faster than CrewAI or AutoGen.
Wordle-style daily format makes ARC-AGI puzzles actually fun to play.
They show a surprisingly large effect: putting models into an interleaved-thinking regime with a stateful IPython REPL yields massive score boosts (>4x on GPT-OSS-120B, double-digit gains up to frontier models). The repo isn't just a paper — it includes pragmatic engineering (a patched vLLM image, ipybox/daytona integration, solver configs) so you can reproduce the results, but expect nontrivial infra setup and API/key requirements.
Think of an RTS game UI for your coding LLMs: spawn Claude or Codex agents, assign tasks, and watch them produce diffs and file edits in real time on a 3D or 2D canvas. The repo bundles practical developer features — built-in file explorer with git diffs, conversation history, permission controls and a command palette — which turns the spectacle into a usable workflow. It’s delightful and ambitious, but gated by the need for Claude/Codex CLIs and local infra, so expect it to appeal mostly to experimenters rather than plug-and-play users.
Mimir hooks into Claude Code lifecycle events so agents can 'mark' facts (e.g., "API uses snake_case") into a DuckDB-backed memory and RAG pipeline, then auto-injects that context as additionalContext for later agents. It's a pragmatic, well-scoped solution to the annoying problem of agent amnesia — very useful if you run agent swarms, but its impact is limited by Claude Code adoption and the need for the surrounding infra (BGE keys, hooks).
Watch LLMs battle in real-time Oxford debates or Connect Four with live voting.