GoldenMatch – 100M-row dedupe on Ray in 213s, no Spark, Arrow-native
Ray-based dedupe at 100M rows without Spark — that's a real architectural choice.
Zero-config entity resolution. The zero-tuning Fellegi-Sunter path beats hand-rolled Splink head-to-head; scales from a CSV to a verified 100M-row dedupe in 9.2 min on Ray. Fuzzy/exact/probabilistic + PPRL + LLM, identity graph. Python + edge-safe TypeScript (optional WASM), SQL-native in Postgres & DuckDB, MCP/REST + dbt/Airflow.
Fellegi-Sunter matching with active learning beats Dedupe.io on complex datasets.
Data engineers, analysts working with messy duplicate records
Dedupe.io · OpenRefine · Tamr
Ray-based dedupe at 100M rows without Spark — that's a real architectural choice.
YAML-driven record linkage beats hand-rolled SQL, but Splink already solved this.
Deduping PRs and scoring them with 20 heuristic signals is a concrete, useful idea — especially the scope-coherence signal and embedding auto-fallback for providers without embeddings. The repo supports CLI, a persistent server, GitHub App integration and an explicit --model flag for provider flexibility, but it's still early and adoption/UX examples (ranked output, workflows) are thin — promising engineering scaffolding that needs real-world validation.
Entropy-based context compression beats naive token stuffing, but the category is crowded.
Entity-centric memory cuts context 90% while matching full-text performance on NovelQA.
Yet another hallucination checker when Guardrails and LMQL already cover this.