Back to browse
GitHub Repository

Stop indexing noise. Turn messy websites and PDFs into clean, structured data for RAG pipelines with semantic importance scoring and token optimization.

3 stars

RAG-Ready Extractor – Structure-aware ingestion with semantic scoring

by cddIT·Mar 8, 2026·3 points·1 comment

AI Analysis

MidSolve My Problem

Noise-filtered PDF/web extraction for RAG, but already solved by Jina, Firecrawl.

Strengths
  • Importance scoring per chunk is a thoughtful angle vs. naive text splitting; metadata context (heading_context) aids retrieval quality
  • Cost benchmarks (30-70% token reduction) are concrete and motivate the problem
  • FastAPI + RapidAPI distribution removes friction for quick integration
Weaknesses
  • Semantic extraction already handled by Jina Reader, Firecrawl, and Document AI; unclear technical differentiation
  • Early-stage: only 2 GitHub stars, no live API demo, vague on extraction method (heuristic vs. learned)
Category
Target Audience

RAG builders, LLM engineers indexing documents, LangChain/LlamaIndex users

Similar To

Jina Reader · Firecrawl · AWS Textract

Similar Projects

Infrastructure●●Solid

Rivestack – Managed PostgreSQL with pgvector, $29/mo

It spins up dedicated Postgres instances with pgvector pre-installed, uses Patroni for HA and pgBackRest for snapshots, and publishes concrete vector benchmarks (2k QPS @ <4ms for 10k vectors; 252 QPS at 1M). The stack choices (Hetzner NVMe, read replicas, HNSW) feel pragmatic for teams who don't want serverless/shared trade-offs, though I'd want clearer SLA/multi-region details and independent benchmarks at larger scales before moving critical workloads.

Niche GemSolve My Problem
stranger90
104mo ago