GitHub Repository

🧠 Make your agents learn from experience. Now available as a hosted solution at kayba.ai

2,352 starsPython

LLM agents that write Python to analyze execution traces at scale

Name: LLM agents that write Python to analyze execution traces at scale
Availability: InStock
Author: kayba

by kayba·Mar 7, 2026·5 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainNiche Gem

Agents write Python to analyze traces; 2x improvement on τ2-bench, but narrow evaluation scope.

Strengths

•Programmatic trace exploration via LLM-written Python catches patterns that single-pass analysis misses.
•Demonstrated 2x consistency improvement on Stanford benchmark; 49% token reduction in browser automation.

Weaknesses

•τ2-bench evaluation is niche—no results on standard agent benchmarks (SWE-bench, WebArena, etc.).
•Requires agents with code execution; adoption friction if integrating into existing agent frameworks.

Post Description

We combined Stanford's ACE (agents learning from execution feedback) with the Reflective Language Model pattern. Instead of reading traces in a single pass, an LLM writes and runs Python in a sandbox to programmatically explore them - finding cross-trace patterns that single-pass analysis misses. The framework achieved 2x consistency improvement on τ2-bench.