Back to browse
Beval – Simple evaluations for your AI product

Beval – Simple evaluations for your AI product

by raviisoccupied·Mar 29, 2026·2 points·1 comment

AI Analysis

●●SolidSolve My Problem

CSV-based evals beat LangSmith for quick PM checks without the infra headache.

Strengths
  • Five-minute setup for non-engineers removes the biggest barrier to entry.
  • Per-trace reasoning output helps users understand specific failure modes immediately.
Weaknesses
  • CSV workflow doesn't scale for continuous integration or production monitoring.
  • Lacks deep integration with existing observability stacks like Datadog.
Category
Target Audience

AI Product Managers

Similar To

LangSmith · Arize Phoenix

Post Description

I have been working on a web app called Beval - Simple evaluations for your AI product.

In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM-based evaluation on conversation transcripts and traces. I didn't need anything fancy, just 'did the agent answer the question', 'did the agent cover the 5 things it needed to' - that type of thing.

I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well - particularly when trying to associate evals with ground truth. And because I was exploring or working on new and experimental features, it wasn't helpful to try and set up something more robust with the team.

To fix the problem I eventually learned to call the OpenAI API in Python, but I really felt that I wanted a 'product' to help me and potentially help others who need answers fast - outside of building infrastructure and pipelines.

So over the last few weeks I built: https://beval.space

It has: - LLM-as-judge evals: boolean checks (yes/no), scores (1-5), categories, and freeform comments - Reusable eval definitions you can run across different datasets - Ground truth labelling so you can compare eval versions against human judgments - Per-trace reasoning so you can see why the judge scored something the way it did - An example dataset so you can try it without having your own traces ready

One of our early users described it as 'quick n dirty evals when you don't want to touch a shit load of infra.' I'm trying to figure out if that's a common need or just a niche thing.

Free during beta. Would love HN's take — what's missing, and would you actually use something like this?

Similar Projects