GitHub Repository

Zero-config entity resolution. The zero-tuning Fellegi-Sunter path beats hand-rolled Splink head-to-head; scales from a CSV to a verified 100M-row dedupe in 9.2 min on Ray. Fuzzy/exact/probabilistic + PPRL + LLM, identity graph. Python + edge-safe TypeScript (optional WASM), SQL-native in Postgres & DuckDB, MCP/REST + dbt/Airflow.

110 starsPython

GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark

Name: GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark
Availability: InStock
Author: benzsevern

by benzsevern·Mar 21, 2026·3 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainSolve My Problem

Fellegi-Sunter matching with active learning beats Dedupe.io on complex datasets.

Strengths

•Fellegi-Sunter EM-trained probabilities with automatic threshold estimation built in.
•Active learning TUI: label 10 borderline pairs, instantly retrain classifier.
•Privacy-preserving bloom filter transforms for fuzzy matching on encrypted PII.

Weaknesses

•Entity resolution is crowded: Dedupe.io, OpenRefine, and commercial tools already exist.
•LLM scoring and Vertex AI embeddings require paid API keys for best accuracy.

Similar Projects

Data●●Solid

GoldenMatch – 100M-row dedupe on Ray in 213s, no Spark, Arrow-native

Ray-based dedupe at 100M rows without Spark — that's a real architectural choice.

Big BrainShip It

benzsevern

3013d ago

Data●●Solid

Kanoniv – Identity resolution in 170 lines of YAML Built in Rust

YAML-driven record linkage beats hand-rolled SQL, but Splink already solved this.

Niche GemShip ItSolve My Problem

dreynow

104mo ago

Developer Tools●●Solid

Treliq – PR triage CLI with 20 signals and optional LLM scoring

Deduping PRs and scoring them with 20 heuristic signals is a concrete, useful idea — especially the scope-coherence signal and embedding auto-fallback for providers without embeddings. The repo supports CLI, a persistent server, GitHub App integration and an explicit --model flag for provider flexibility, but it's still early and adoption/UX examples (ranked output, workflows) are thin — promising engineering scaffolding that needs real-world validation.

Niche GemSolve My Problem

chrismagno

103mo ago

AI/ML●●Solid