Tracecore: Benchmark AI Agents on Deterministic Coding Tasks
Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.

81% on hard web tasks when Operator scores 43%. Real benchmark, all runs public.
Enterprise teams automating complex web tasks; companies like DoorDash, Google Hotels needing reliable agent deployment.
OpenAI Operator · Claude Computer Use · Browser Use
Results of hard task scores on Online-Mind2Web (300 tasks, 136 live websites, human-correlated judge):
- TinyFish: 81.9% - OpenAI Operator: 43.2% - Claude Computer Use: 32.4% - Browser Use: 8.1%
Why not WebVoyager like everyone else?
Because it's broken. Easy tasks, Google Search shortcuts, and a judge that agrees with humans only 62% of the time. Browser Use self-reported 89% on WebVoyager — then scored 8.1% on hard tasks here.
We evaluated TinyFish against Online-Mind2Web instead — 300 real tasks, 136 live websites, three difficulty levels, and a judge that agrees with humans 85% of the time. No shortcuts. No easy mode.
The cookbook repo is open source: https://github.com/tinyfish-io/tinyfish-cookbook
You can see all failure task runs form here: https://tinyurl.com/tinyfish-mind2web
Happy to answer questions about the architecture, the benchmark methodology, or why we think WebVoyager scores are misleading.
Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.
First benchmark testing structured requirements on complex greenfield agent tasks.
Proves mesh-to-BREP failure modes with IRT-calibrated scores across 28-task pilot suite.
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.
Deterministic browser steps for agent reasoning, but README is just Chromium boilerplate with no substantive implementation details.
Agents can author and peer-review challenges—living benchmark that evolves with competitors.