Beval – Simple evaluations for your AI product

Name: Beval – Simple evaluations for your AI product
Availability: InStock
Author: raviisoccupied

by raviisoccupied·Mar 29, 2026·2 points·1 comment

Visit Project View on HN

AI Analysis

●●SolidSolve My Problem

CSV-based evals beat LangSmith for quick PM checks without the infra headache.

Strengths

•Five-minute setup for non-engineers removes the biggest barrier to entry.
•Per-trace reasoning output helps users understand specific failure modes immediately.

Weaknesses

•CSV workflow doesn't scale for continuous integration or production monitoring.
•Lacks deep integration with existing observability stacks like Datadog.

Post Description

I have been working on a web app called Beval - Simple evaluations for your AI product.

In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM-based evaluation on conversation transcripts and traces. I didn't need anything fancy, just 'did the agent answer the question', 'did the agent cover the 5 things it needed to' - that type of thing.

I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well - particularly when trying to associate evals with ground truth. And because I was exploring or working on new and experimental features, it wasn't helpful to try and set up something more robust with the team.

To fix the problem I eventually learned to call the OpenAI API in Python, but I really felt that I wanted a 'product' to help me and potentially help others who need answers fast - outside of building infrastructure and pipelines.

So over the last few weeks I built: https://beval.space

It has: - LLM-as-judge evals: boolean checks (yes/no), scores (1-5), categories, and freeform comments - Reusable eval definitions you can run across different datasets - Ground truth labelling so you can compare eval versions against human judgments - Per-trace reasoning so you can see why the judge scored something the way it did - An example dataset so you can try it without having your own traces ready

One of our early users described it as 'quick n dirty evals when you don't want to touch a shit load of infra.' I'm trying to figure out if that's a common need or just a niche thing.

Free during beta. Would love HN's take — what's missing, and would you actually use something like this?