A new benchmark for testing LLMs for deterministic outputs

Name: A new benchmark for testing LLMs for deterministic outputs
Availability: InStock
Author: khurdula

by khurdula·Apr 29, 2026·60 points·30 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My Problem

Finally separates JSON validity from actual value hallucination in LLM outputs.

Strengths

•Tests three modalities (text, image, audio) with a single scoring harness.
•Ground-truth answers verified against source context to catch silent breaks.
•Exposes that structural metrics saturate while value accuracy separates models.

Weaknesses

•Relies on human-authored ground truth which limits scale and freshness.
•No integration with CI pipelines or model training loops yet.

Post Description

When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.

The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.

Structured output today is a big part of using LLMs, especially when building deterministic workflows.

Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.

So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.

For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.

Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.

We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.

For example, GPT-5.4 ranks 3rd on text but 9th on images.

Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.

Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.

Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Similar Projects

AI/ML●●Solid

EvalLens – Open-source tool to evaluate structured LLM outputs

Schema conformance checks beat generic text evals for JSON-heavy LLM pipelines.

Niche GemShip It

simonrendon

102mo ago

AI/ML●●Solid

ÆTHERYA Core – deterministic action-governance kernel for LLM agents

Fail-closed policy layer blocks LLM tool calls before execution, no LLM in decision path.

Big BrainNiche Gem

RobertMihai

103mo ago

AI/ML●●Solid

VigIA – A deterministic FSM in .NET 10 to stop LLM hallucinations

Deterministic FSM guards LLM outputs to prevent hallucination spirals in .NET.

Big BrainBold Bet

JordanCT

302mo ago

Developer Tools●●Solid

LLMTest – The pytest for LLMs with 22 built-in assertions

Pytest syntax for LLM testing avoids LLM-judge cost, but feature parity vs. LangSmith and Braintrust unproven.

Solve My ProblemSlick

furtims

403mo ago

AI/ML●Mid

Hallx – Hallucination risk scoring for LLM outputs

Yet another hallucination checker when Guardrails and LMQL already cover this.

Ship It

akadhanu

222mo ago

SaaS●●Solid

A "content compiler" that turns LLM output into validated artifacts

Treating LLM output like compiler input — with typed style guides, required-section enforcement, and explicit Confidence/LostElements on transformations — is a clever, non-obvious approach that could actually raise the signal-to-noise on generated content. The product shows useful practical features (export to PDF/HTML/JSON, jurisdiction-aware legal drafting, slide generation), but the real test will be how maintainable and authorable those rule sets are in messy, real-world workflows.

Big BrainBold Bet

hardikparikh29

104mo ago