GitHub Repository

Stop indexing noise. Turn messy websites and PDFs into clean, structured data for RAG pipelines with semantic importance scoring and token optimization.

3 stars

RAG-Ready Extractor – Structure-aware ingestion with semantic scoring

Name: RAG-Ready Extractor – Structure-aware ingestion with semantic scoring
Availability: InStock
Author: cddIT

by cddIT·Mar 8, 2026·3 points·1 comment

Visit Project View on HN

AI Analysis

●MidSolve My Problem

Noise-filtered PDF/web extraction for RAG, but already solved by Jina, Firecrawl.

Strengths

•Importance scoring per chunk is a thoughtful angle vs. naive text splitting; metadata context (heading_context) aids retrieval quality
•Cost benchmarks (30-70% token reduction) are concrete and motivate the problem
•FastAPI + RapidAPI distribution removes friction for quick integration

Weaknesses

•Semantic extraction already handled by Jina Reader, Firecrawl, and Document AI; unclear technical differentiation
•Early-stage: only 2 GitHub stars, no live API demo, vague on extraction method (heuristic vs. learned)

Similar Projects

AI/ML●●●Banger

Dewey – Ingest docs, search semantically, get cited AI answers

Structure-aware chunking beats flat embeddings for accurate multi-hop research and citations.

Big BrainSlickSolve My Problem

lambdabaa

302mo ago

AI/ML○Pass

Krira Augment – Production-ready RAG in minutes

Waitlist for RAG platform launching in 2 months with no demo.

Bold Bet

kriralabs

102mo ago

Infrastructure●●Solid

Rivestack – Managed PostgreSQL with pgvector, $29/mo

It spins up dedicated Postgres instances with pgvector pre-installed, uses Patroni for HA and pgBackRest for snapshots, and publishes concrete vector benchmarks (2k QPS @ <4ms for 10k vectors; 252 QPS at 1M). The stack choices (Hetzner NVMe, read replicas, HNSW) feel pragmatic for teams who don't want serverless/shared trade-offs, though I'd want clearer SLA/multi-region details and independent benchmarks at larger scales before moving critical workloads.

Niche GemSolve My Problem

stranger90

104mo ago

AI/ML●●Solid