Back to browse
TinyFish Web Agent (82% on hard tasks vs. Operator's 43%)

TinyFish Web Agent (82% on hard tasks vs. Operator's 43%)

by gargi_tinyfish·Feb 12, 2026·17 points·12 comments

AI Analysis

●●●BangerBig BrainSolve My ProblemWizardry

81% on hard web tasks when Operator scores 43%. Real benchmark, all runs public.

Strengths
  • Crushing benchmark gap (81.9% vs 43.2%) on rigorous, live-website tasks with 85% human-correlated judge.
  • Full transparency: published all 300 task runs including failures; no cherry-picked easy tasks or gaming.
  • Production-tested at scale with real clients (DoorDash, Google Hotels); benchmarks backed by deployed use.
Weaknesses
  • Benchmark performance doesn't always translate to real-world value; needs long-term deployment data.
  • Positioned as enterprise tool but pricing/availability details unclear from landing page alone.
Category
Target Audience

Enterprise teams automating complex web tasks; companies like DoorDash, Google Hotels needing reliable agent deployment.

Similar To

OpenAI Operator · Claude Computer Use · Browser Use

Post Description

Enterprises need ~90% accuracy to deploy web agents. Until now, no agent has come close on real-world tasks. TinyFish is the first production-ready web agent. Here's the evidence.

Results of hard task scores on Online-Mind2Web (300 tasks, 136 live websites, human-correlated judge):

- TinyFish: 81.9% - OpenAI Operator: 43.2% - Claude Computer Use: 32.4% - Browser Use: 8.1%

Why not WebVoyager like everyone else?

Because it's broken. Easy tasks, Google Search shortcuts, and a judge that agrees with humans only 62% of the time. Browser Use self-reported 89% on WebVoyager — then scored 8.1% on hard tasks here.

We evaluated TinyFish against Online-Mind2Web instead — 300 real tasks, 136 live websites, three difficulty levels, and a judge that agrees with humans 85% of the time. No shortcuts. No easy mode.

The cookbook repo is open source: https://github.com/tinyfish-io/tinyfish-cookbook

You can see all failure task runs form here: https://tinyurl.com/tinyfish-mind2web

Happy to answer questions about the architecture, the benchmark methodology, or why we think WebVoyager scores are misleading.

Similar Projects

AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain
ryan4rtmx
2025d ago