GitHub Repository

5 starsPython

LLM Council – Run multiple LLMs with critique and consensus eval

Name: LLM Council – Run multiple LLMs with critique and consensus eval
Availability: InStock
Author: gauravvij137

by gauravvij137·Feb 25, 2026·4 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainSolve My Problem

Async multi-LLM synthesis with critique—cuts latency via parallel calls, not sequential voting.

Strengths

•~200 LOC core with async/await means true concurrent latency (slowest model, not sum)
•Provider-agnostic: 200+ models via OpenRouter in one config
•Built for evals, not demos—structured output and retries for production pipelines

Weaknesses

•'Critique and consensus' is orchestration pattern, not novel insight—similar to LangChain agents
•No benchmarks showing consensus actually reduces hallucinations vs single GPT-4

Post Description

Building reliable LLM systems often means not trusting a single model.

We open-sourced LLM Council: https://github.com/abhishekgandhi-neo/llm_council

It’s a small framework we internally built with Neo to run multiple LLMs on the same task, let them critique each other, and produce a structured final answer.

Useful for tasks like: • Comparing local vs API models on your own dataset • Validating RAG outputs • Prompt regression testing • Dataset labeling with model-as-judge • Catching hallucinations in code or research summaries

A few practical details: • Async parallel calls so latency stays close to one model • Structured outputs with each model’s answer and critiques • Provider-agnostic configs for local + hosted models • Built to plug into evaluation pipelines, not just demos

We built this using Neo. We’ve been experimenting with similar council setups to catch silent failures in ML workflows, and this repo is a cleaned-up version of that idea.

If you’ve built multi-LLM evaluation pipelines, would love to hear what aggregation or critique strategies worked well for you.

Similar Projects

AI/ML●●Solid

Run an Agent Council of LLMs that debate and synthesize answers

Agent council debate architecture with GSM8K benchmarks showing accuracy gains.

Big BrainNiche Gem

JitseLambrichts

823mo ago

AI/ML●Mid

Claude Code skills for building LLM evals

Structured eval workflow for Claude Code when LangSmith and Braintrust already exist.

Niche GemShip It

paulaq

201mo ago

Security●●Solid

MCPSafe – Free security scanner for MCP servers using 5-LLM consensus

Five-LLM consensus catches prompt injection patterns static analysis misses.

Solve My ProblemShip It

nhattruongadm

301mo ago

Design●Mid

I built an AI that roasts your Figma UI

LLM design roasting in Figma, but remove.bg/Figma AI already own this.

Eye CandyCrowd Pleaser

junetic

103mo ago

AI/ML●●Solid

Quoracle: Self-replicating multi-LLM-consensus agents (Elixir)

Quoracle actually does something interesting: it queries a pool of models and only executes actions they agree on, while letting agents spawn children and persist full state to Postgres — all visible in a LiveView dashboard. The per-model conversation history, recursive hierarchy, and explicit consensus pipeline are clever touches; it’s clearly aimed at experimentation rather than drop-in production use (the README even flags security and deployment caveats).

WizardryNiche Gem

shelvick

214mo ago

Developer Tools●●Solid

An agent skill for eval-driven development of LLM-powered app

Agent-native eval workflow beats LangSmith's manual dashboard setup.

Big BrainShip It

yol

103mo ago