OpenCastor Agent Harness Evaluator Leaderboard

Name: OpenCastor Agent Harness Evaluator Leaderboard
Availability: InStock
Author: craigm26

by craigm26·Mar 23, 2026·3 points·1 comment

Visit Project View on HN

AI Analysis

●●●BangerZero to OneBig BrainNiche Gem

263k config search space benchmarked across robot fleets—nothing like this exists for robotics AI.

Strengths

•Distributed evaluation model lets robots contribute idle compute instead of centralized benchmarking infrastructure.
•Eight-dimensional config search (routing, context budget, retry logic) proves pipeline matters as much as model choice.
•Safety-first design strips P66 parameters on apply—no harness config can override motor limits or ESTOP.

Weaknesses

•Currently demo data only—real fleet runs needed to validate the benchmark methodology.
•Robotics audience is inherently narrow compared to general AI tooling.

Post Description

I've been building OpenCastor, a runtime layer that sits between a robot's hardware and its AI agent. One thing that surprised me: the order you arrange the skill pipeline (context builder → model router → error handler, etc.) and parameters like thinking_budget and context_budget affect task success rates as much as model choice does.

So I built a distributed evaluator. Robots contribute idle compute to benchmark harness configurations against OHB-1, a small benchmark of 30 real-world robot tasks (grip, navigate, respond, etc.) using local LLM calls via Ollama. The search space is 263,424 configs (8 dimensions: model routing, context budget, retry logic, drift detection, etc.). The demo leaderboard shows results so far, broken down by hardware tier (Pi5+Hailo, Jetson, server, budget boards).

The current champion config is free to download as a YAML and apply to any robot. P66 safety parameters are stripped on apply — no harness config can touch motor limits or ESTOP logic.

Looking for feedback on: (1) whether the benchmark tasks are representative, (2) whether the hardware tier breakdown is useful, and (3) anyone who's run fleet-wide distributed evals of agent configs for robotics or otherwise.

Similar Projects

Developer Tools●●●Banger

Tracecore: Benchmark AI Agents on Deterministic Coding Tasks

Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.

Solve My ProblemWizardryNiche Gem

extra_cookin

103mo ago

AI/ML●●Solid

Jeju – a local-first agent harness with inspectable runs

Manifest-driven agents with eval feedback loops when most harnesses are prompt-only.

Big BrainNiche Gem

cosmtrek

1016d ago

Developer Tools●●●Banger

Cheddar-bench – unsupervised benchmark for coding agents

Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Big BrainWizardryShip It

przadka

903mo ago

AI/ML●●Solid

An Interactive Text to SQL Agent Benchmark

Interactive DuckDB-WASM benchmark beats static leaderboards for agentic SQL eval.

Big BrainNiche Gem

102mo ago

AI/ML●●Solid

Apodex-1.0 – Deep research with independent verifier (90.3 BrowseComp)

90.3 BrowseComp score with verification-centric model architecture.

Niche Gem

wuqiaocauc

1012d ago

AI/ML●●Solid

NetHack agent harness with benchmarks and livestream

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Niche GemWizardry

kenforthewin

114mo ago