Back to browse
OpenCastor Agent Harness Evaluator Leaderboard

OpenCastor Agent Harness Evaluator Leaderboard

by craigm26·Mar 23, 2026·3 points·1 comment

AI Analysis

●●●BangerZero to OneBig BrainNiche Gem

263k config search space benchmarked across robot fleets—nothing like this exists for robotics AI.

Strengths
  • Distributed evaluation model lets robots contribute idle compute instead of centralized benchmarking infrastructure.
  • Eight-dimensional config search (routing, context budget, retry logic) proves pipeline matters as much as model choice.
  • Safety-first design strips P66 parameters on apply—no harness config can override motor limits or ESTOP.
Weaknesses
  • Currently demo data only—real fleet runs needed to validate the benchmark methodology.
  • Robotics audience is inherently narrow compared to general AI tooling.
Category
Target Audience

Robotics engineers, AI agent developers, edge computing teams

Similar To

MLPerf · Robotics Open Benchmarks · Hugging Face Open LLM Leaderboard

Post Description

I've been building OpenCastor, a runtime layer that sits between a robot's hardware and its AI agent. One thing that surprised me: the order you arrange the skill pipeline (context builder → model router → error handler, etc.) and parameters like thinking_budget and context_budget affect task success rates as much as model choice does.

So I built a distributed evaluator. Robots contribute idle compute to benchmark harness configurations against OHB-1, a small benchmark of 30 real-world robot tasks (grip, navigate, respond, etc.) using local LLM calls via Ollama. The search space is 263,424 configs (8 dimensions: model routing, context budget, retry logic, drift detection, etc.). The demo leaderboard shows results so far, broken down by hardware tier (Pi5+Hailo, Jetson, server, budget boards).

The current champion config is free to download as a YAML and apply to any robot. P66 safety parameters are stripped on apply — no harness config can touch motor limits or ESTOP logic.

Looking for feedback on: (1) whether the benchmark tasks are representative, (2) whether the hardware tier breakdown is useful, and (3) anyone who's run fleet-wide distributed evals of agent configs for robotics or otherwise.

Similar Projects

AI/ML●●Solid

Jeju – a local-first agent harness with inspectable runs

Manifest-driven agents with eval feedback loops when most harnesses are prompt-only.

Big BrainNiche Gem
cosmtrek
1016d ago
AI/ML●●Solid

NetHack agent harness with benchmarks and livestream

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Niche GemWizardry
kenforthewin
114mo ago