ocrbase – PDF/IMG –>.MD/JSON Model-Agnostic OCR API
Yet another OCR API wrapper when JinaAI and Firecrawl already exist.

94.5% accuracy, self-hostable, open source—beats Textract on cost and accuracy.
Backend developers, document processing teams, businesses replacing AWS Textract or Azure Document AI
AWS Textract · Azure Document AI · Anthropic's document parsing
Yet another OCR API wrapper when JinaAI and Firecrawl already exist.
Offline Ollama + OCR keeps your documents private when cloud APIs won't.
LLM infers schema once, Go does 10k-row extraction—avoids token waste.
ProofPudding returns extraction results with explicit links back to the exact page and source text, supports native and scanned PDFs plus DOCX/images, and ships Python/TypeScript SDKs — handy for agents that need auditable facts. It’s a pragmatic product (per-extraction pricing and confidence scores are nice), but the market is crowded; I want clarity on underlying models, real-world accuracy numbers, and how it compares to Document AI/Textract in edge cases.
93% accuracy document extraction, but remove.bg-style competition already exists.
Useful dataset for UK researchers but it's a Kaggle upload, not a reusable tool.