We built an OCR server that can process 270 dense images/s on a 5090
50x faster than PaddleOCR Python with real TensorRT benchmarks.
Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.
50x faster than PaddleOCR Python with real TensorRT benchmarks on RTX 5090.
Engineers processing large document volumes at scale
PaddleOCR · EasyOCR · Tesseract
50x faster than PaddleOCR Python with real TensorRT benchmarks.
92× faster than CPU Super Bloom with minimizer-based shard selection.
This reads like a GPU engineer's field notes — one ~3,400-line CUDA file implements a full per-thread crypto pipeline (key gen → EC multiply → SHA-256 → RIPEMD-160) and a two-stage bloom+binary-search matcher to check ~3,100 targets at ~100M keys per batch. The article digs into concrete low-level choices (LUT layout, memory hierarchy, __ldg reads, atomicCAS reporting, and per-mode keygen strategies), which is rare in public writeups; downside is it's closed-source and the dual-use/ethical implications should be called out more explicitly.
CUDA pipeline hits 60 FPS on 45MP RAW files, competing with Darktable.
Direct2D GPU PDF renderer with CPU fallback, but alpha-stage and Windows-only.
Infinite canvas terminal with minimap—tmux tabs but actually visible.