TinyFish Web Agent (82% on hard tasks vs. Operator's 43%)

Name: TinyFish Web Agent (82% on hard tasks vs. Operator's 43%)
Availability: InStock
Author: gargi_tinyfish

by gargi_tinyfish·Feb 12, 2026·17 points·12 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My ProblemWizardry

81% on hard web tasks when Operator scores 43%. Real benchmark, all runs public.

Strengths

•Crushing benchmark gap (81.9% vs 43.2%) on rigorous, live-website tasks with 85% human-correlated judge.
•Full transparency: published all 300 task runs including failures; no cherry-picked easy tasks or gaming.
•Production-tested at scale with real clients (DoorDash, Google Hotels); benchmarks backed by deployed use.

Weaknesses

•Benchmark performance doesn't always translate to real-world value; needs long-term deployment data.
•Positioned as enterprise tool but pricing/availability details unclear from landing page alone.

Post Description

Enterprises need ~90% accuracy to deploy web agents. Until now, no agent has come close on real-world tasks. TinyFish is the first production-ready web agent. Here's the evidence.

Results of hard task scores on Online-Mind2Web (300 tasks, 136 live websites, human-correlated judge):

- TinyFish: 81.9% - OpenAI Operator: 43.2% - Claude Computer Use: 32.4% - Browser Use: 8.1%

Why not WebVoyager like everyone else?

Because it's broken. Easy tasks, Google Search shortcuts, and a judge that agrees with humans only 62% of the time. Browser Use self-reported 89% on WebVoyager — then scored 8.1% on hard tasks here.

We evaluated TinyFish against Online-Mind2Web instead — 300 real tasks, 136 live websites, three difficulty levels, and a judge that agrees with humans 85% of the time. No shortcuts. No easy mode.

The cookbook repo is open source: https://github.com/tinyfish-io/tinyfish-cookbook

You can see all failure task runs form here: https://tinyurl.com/tinyfish-mind2web

Happy to answer questions about the architecture, the benchmark methodology, or why we think WebVoyager scores are misleading.

Similar Projects

Developer Tools●●●Banger

Tracecore: Benchmark AI Agents on Deterministic Coding Tasks

Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.

Solve My ProblemWizardryNiche Gem

extra_cookin

103mo ago

AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain

ryan4rtmx

2025d ago

AI/ML●●●Banger

CADBench – every AI CAD tool I tested fails on basic mechanical parts

Proves mesh-to-BREP failure modes with IRT-calibrated scores across 28-task pilot suite.

Big BrainDark Horse

ryanrana

101mo ago

Developer Tools●●●Banger

Cheddar-bench – unsupervised benchmark for coding agents

Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Big BrainWizardryShip It

przadka

904mo ago

AI/ML●Mid

Deterministic browser control for AI agents (~90% on Mind2Web)

Deterministic browser steps for agent reasoning, but README is just Chromium boilerplate with no substantive implementation details.

Big Brain

theredsix

1273mo ago

Developer Tools●●●Banger

A dynamic, crowdsourced benchmark for AI agents

Agents can author and peer-review challenges—living benchmark that evolves with competitors.

Crowd PleaserZero to OneBig Brain

shalinmehtaaa

103mo ago