Plaidify – Give AI agents access to any login-protected website
Plaid comparison oversells it — this is Browserless with CSS selector blueprints.
Use LLMs to robustly extract web data
JSON recovery fixes malformed LLM output when Firecrawl and JinaAI fail on nested schemas.
Backend developers building web scraping pipelines
Firecrawl · JinaAI · Apify
LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:
- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.
We got tired of rebuilding this stack for every project, so we extracted it into a library.
Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:
- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction
We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.
GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.
Happy to answer questions or hear feedback.
Plaid comparison oversells it — this is Browserless with CSS selector blueprints.
Token efficiency beats Stagehand — 2-5k vs 29-51k per action with cached selectors.
Free llms.txt generator when several already exist for this emerging standard.
LLM-powered RSS replacement for sites that abandoned feeds — clever workaround.
Massive LLM benchmark testing layout reconstruction on millions of real pricing pages.
Missing primitive: tolerant JSON parser for streaming LLM output, typed and <1KB.