GitHub Repository

A vision-first browser agent with self-healing deterministic replay.

49 starsTypeScript

Lumen – vision-first browser agent (state of the art, open source)

Name: Lumen – vision-first browser agent (state of the art, open source)
Availability: InStock
Author: fearlessboi

by fearlessboi·Mar 10, 2026·2 points·1 comment

Visit Project View on HN

AI Analysis

●●SolidBig BrainSolve My Problem

Vision-only coordinates beat DOM selectors where Stagehand and browser-use still stumble on UI changes.

Strengths

•Pure vision loop avoids DOM dependency entirely, reducing breakage from CSS class changes.
•Benchmarks show 100% success rate on WebVoyager subset, faster than browser-use.
•Session resumption and history compression handle 20+ step workflows without context bloat.

Weaknesses

•Requires Chrome/Chromium locally; no hosted cloud option yet for heavy scaling.
•Vision models cost more per token than simple selector scripts for high-volume tasks.

Post Description

We're open sourcing Lumen — state of the art, vision-first browser agent.

Problem

Browser automation is fragile. Scripts break constantly and agents waste tokens getting stuck in loops.

Today there are two options:

Selector-based scripting like Playwright and Puppeteer: these require you to target specific DOM elements.

First-generation browser agents (Stagehand, browser-use): use natural language interfaces but still resolve instructions into selectors under the hood.

Selector-based scripts can break every time the UI changes. You end up maintaining selectors instead of building features. First-gen agents inherit the same brittleness, especially when they misidentify the right element.

Solution:

Lumen is vision-first. It sees the screen and acts like a human. Every natural language instruction resolves into an x,y coordinate on the screen. Three layers of stuck detection keep it on track and a dual-history system with context compaction lets it handle 20+ step workflows without blowing up the context window.

We ran a WebVoyager eval (25 tasks across 15 sites, scored by LLM judge, 3 trials per task, all frameworks on Claude Sonnet 4.6):

Lumen: 100% success rate, 77.8s avg time, ~104K tokens.

browser-use: 100% success rate, 109.8s avg time.

Stagehand: 76% success rate, 207.8s avg time, ~200K tokens.

Lumen matches browser-use on accuracy while completing tasks ~30% faster, and beats Stagehand on every metric.

Get Started

Start using Lumen today:

Docs: https://lumen.omlabs.xyz/

Support us: GitHub star: https://github.com/omxyz/lumen

Similar Projects

AI/ML●●●Banger

SoMatic – Vision-based OS automation framework for AI agents

Brings Set-of-Marks prompting to native OS apps where DOM trees don't exist.

Big BrainWizardryZero to One

smyansondur

2029d ago

Developer Tools●●Solid

ScreenCommander – Let LLM Agents control your desktop via CLI

Retina-aware screenshot + deterministic coordinate mapping for agent desktop control.

WizardryNiche Gem

aethelyon

113mo ago

Developer Tools●Mid

Automate Mac with Codex: macOS Control MCP Demo

Lets agents actually see the screen and act on it by returning OCR text with pixel coordinates and offering commands like click_at, type_text, and press_key. You can run it instantly with npx (it auto-creates a Python venv and hooks into Apple Vision/Quartz), and there are ready-made integration snippets for Claude, VS Code, and Cursor — a pragmatic, technically neat tool for closed-loop agent UI work. It’s limited to macOS 13+ and Apple APIs, but within that niche it removes a lot of friction.

WizardryNiche Gem

peterhddcode

104mo ago

AI/ML●●●Banger

Smriti: Shared Reasoning State for Claude Code and Codex

Git for agent reasoning state solves the multi-agent coordination collision problem.

Big BrainZero to One

himanshudongre

1025d ago

Productivity●●Solid

SkillForge – Turn Screen Recordings into Agent-Ready Skills

SkillForge turns the old 'show, don't tell' trick into code: record a task, and their AI teases clicks, keystrokes and navigation out of pixels into a stepwise skill file you can edit and export. The ability to trim video, rewrite steps via AI, and output a SKILL.md for agent frameworks is a practical, opinionated workflow that could shortcut lots of brittle RPA scripting — my main questions are reliability across dynamic UIs and privacy/recording controls, but the product direction is smart and tangible.

WizardrySolve My Problem

YaraDori

124mo ago

Developer Tools●●●Banger

Smriti, version control for reasoning state

Git for agent reasoning state beats orchestrators like LangGraph on parallel work.

Zero to OneBig BrainShip It

himanshudongre

202mo ago