Back to browse
GitHub Repository
5 starsPython

LLM Council – Run multiple LLMs with critique and consensus eval

by gauravvij137·Feb 25, 2026·4 points·0 comments

AI Analysis

●●SolidBig BrainSolve My Problem

Async multi-LLM synthesis with critique—cuts latency via parallel calls, not sequential voting.

Strengths
  • ~200 LOC core with async/await means true concurrent latency (slowest model, not sum)
  • Provider-agnostic: 200+ models via OpenRouter in one config
  • Built for evals, not demos—structured output and retries for production pipelines
Weaknesses
  • 'Critique and consensus' is orchestration pattern, not novel insight—similar to LangChain agents
  • No benchmarks showing consensus actually reduces hallucinations vs single GPT-4
Category
Target Audience

ML engineers building eval pipelines, RAG systems, and dataset labeling workflows

Similar To

LangChain · LlamaIndex · DSPy

Post Description

Building reliable LLM systems often means not trusting a single model.

We open-sourced LLM Council: https://github.com/abhishekgandhi-neo/llm_council

It’s a small framework we internally built with Neo to run multiple LLMs on the same task, let them critique each other, and produce a structured final answer.

Useful for tasks like: • Comparing local vs API models on your own dataset • Validating RAG outputs • Prompt regression testing • Dataset labeling with model-as-judge • Catching hallucinations in code or research summaries

A few practical details: • Async parallel calls so latency stays close to one model • Structured outputs with each model’s answer and critiques • Provider-agnostic configs for local + hosted models • Built to plug into evaluation pipelines, not just demos

We built this using Neo. We’ve been experimenting with similar council setups to catch silent failures in ML workflows, and this repo is a cleaned-up version of that idea.

If you’ve built multi-LLM evaluation pipelines, would love to hear what aggregation or critique strategies worked well for you.

Similar Projects

AI/MLMid

Claude Code skills for building LLM evals

Structured eval workflow for Claude Code when LangSmith and Braintrust already exist.

Niche GemShip It
paulaq
201mo ago
AI/ML●●Solid

Quoracle: Self-replicating multi-LLM-consensus agents (Elixir)

Quoracle actually does something interesting: it queries a pool of models and only executes actions they agree on, while letting agents spawn children and persist full state to Postgres — all visible in a LiveView dashboard. The per-model conversation history, recursive hierarchy, and explicit consensus pipeline are clever touches; it’s clearly aimed at experimentation rather than drop-in production use (the README even flags security and deployment caveats).

WizardryNiche Gem
shelvick
214mo ago