I built a new software primitive. It replaces AI screenshot agents

Name: I built a new software primitive. It replaces AI screenshot agents
Availability: InStock
Author: Directshell

by Directshell·Feb 17, 2026·5 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainWizardryZero to One

Replaces screenshots with structured UI data; AI agents control Windows apps via SQL queries.

Strengths

•Core insight is genuinely clever: accessibility tree (live since 1997) contains structured text that vision models waste tokens reconstructing from screenshots—50–200 tokens vs 1,200–5,000.
•Tiny footprint (700KB, two Rust dependencies) and runs as MCP server so Claude/GPT can use directly—zero setup friction.
•5 action primitives (click by name, set text, type, key combos, scroll) cover typical UI automation workflows without screenshot loop overhead.

Weaknesses

•Windows-only (accessibility tree / UIA is a Windows-centric API); macOS and Linux have different a11y frameworks requiring separate implementations.
•Early-stage (2 stars, day 1 release); reliability and coverage across diverse Windows apps unproven; profile repository of app configs doesn't exist yet.

Post Description

I built a Rust binary (~700KB, two dependencies) that snaps onto any Windows application, continuously reads its UI Automation accessibility tree into a SQLite database, and lets external processes control the app via SQL INSERT.

The core insight: every GUI application already describes itself as structured text through the accessibility layer (built for screen readers since 1997). Every major AI lab is taking screenshots of this and running vision models on it. DirectShell just reads the text directly.

What it does: - Reads every button, field, menu item into a queryable SQLite DB (refreshed every 500ms) - Generates multiple output formats: full DB, interactive elements list, LLM-optimized snapshots (50-200 tokens vs 1,200-5,000 for a screenshot) - Controls apps via 5 action types: click by element name, set text via UIA ValuePattern, type character-by-character, send key combos, scroll - Includes an MCP server so Claude/GPT can use it directly

Day 1 demo: filled 360 Google Sheets cells in 90 seconds, read and replied to a Claude.ai conversation cross-app, wrote to Notepad instantly. No screenshots, no vision model, no coordinate guessing.

Limitations (honest): built in 8.5 hours, single-app scope, Chromium apps need a 4-phase activation hack to expose their tree, accessibility quality varies by app. AGPL-3.0.

Demo: https://youtu.be/nvZobyt0KBg Full technical paper: in the repo under Dokumentation/ and on https://dev.to/tlrag/i-built-a-new-software-primitive-in-85-...