Back to browse
udoc. Dependency-free document extraction in Rust

udoc. Dependency-free document extraction in Rust

by newelh·May 20, 2026·5 points·1 comment

AI Analysis

●●●BangerWizardryBig Brain

Pure Rust parsers for legacy Office formats with zero external dependencies.

Strengths
  • Native parsers for binary .doc and .xls eliminate need for LibreOffice headless.
  • Streaming JSONL output allows processing multi-gigabyte PDFs without OOM errors.
  • Unified document model abstracts away format-specific quirks for all inputs.
Weaknesses
  • Cargo crate not yet published, requiring manual build or uvx for now.
  • OCR and layout detection require piping output to external hooks manually.
Target Audience

Backend engineers building document processing pipelines

Similar To

Apache Tika · Pandoc · Poppler

Similar Projects

AI/ML●●Solid

DocMason – AI Agent Knowledge Base for local complex office files

Preserves document structure instead of flattening to text like most RAG tools.

Solve My ProblemBold Bet
Jet_Xu
232mo ago
AI/ML●●Solid

ProofPudding – Document Extraction API with Citations (PDF/Docx)

ProofPudding returns extraction results with explicit links back to the exact page and source text, supports native and scanned PDFs plus DOCX/images, and ships Python/TypeScript SDKs — handy for agents that need auditable facts. It’s a pragmatic product (per-extraction pricing and confidence scores are nice), but the market is crowded; I want clarity on underlying models, real-world accuracy numbers, and how it compares to Document AI/Textract in edge cases.

Solve My ProblemSlick
garai
104mo ago
AI/ML●●Solid

DocMason – Agent Knowledge Base for local complex office files

Provenance-first RAG beats anonymous text chunks, but Cursor and Continue already own this space.

Big BrainNiche Gem
Jet_Xu
1102mo ago