Run an Agent Council of LLMs that debate and synthesize answers
Agent council debate architecture with GSM8K benchmarks showing accuracy gains.
Async multi-LLM synthesis with critique—cuts latency via parallel calls, not sequential voting.
ML engineers building eval pipelines, RAG systems, and dataset labeling workflows
LangChain · LlamaIndex · DSPy
We open-sourced LLM Council: https://github.com/abhishekgandhi-neo/llm_council
It’s a small framework we internally built with Neo to run multiple LLMs on the same task, let them critique each other, and produce a structured final answer.
Useful for tasks like: • Comparing local vs API models on your own dataset • Validating RAG outputs • Prompt regression testing • Dataset labeling with model-as-judge • Catching hallucinations in code or research summaries
A few practical details: • Async parallel calls so latency stays close to one model • Structured outputs with each model’s answer and critiques • Provider-agnostic configs for local + hosted models • Built to plug into evaluation pipelines, not just demos
We built this using Neo. We’ve been experimenting with similar council setups to catch silent failures in ML workflows, and this repo is a cleaned-up version of that idea.
If you’ve built multi-LLM evaluation pipelines, would love to hear what aggregation or critique strategies worked well for you.
Agent council debate architecture with GSM8K benchmarks showing accuracy gains.
Structured eval workflow for Claude Code when LangSmith and Braintrust already exist.
Five-LLM consensus catches prompt injection patterns static analysis misses.
LLM design roasting in Figma, but remove.bg/Figma AI already own this.
Quoracle actually does something interesting: it queries a pool of models and only executes actions they agree on, while letting agents spawn children and persist full state to Postgres — all visible in a LiveView dashboard. The per-model conversation history, recursive hierarchy, and explicit consensus pipeline are clever touches; it’s clearly aimed at experimentation rather than drop-in production use (the README even flags security and deployment caveats).
Agent-native eval workflow beats LangSmith's manual dashboard setup.