Back to browse
GitHub Repository

Zero-config entity resolution. The zero-tuning Fellegi-Sunter path beats hand-rolled Splink head-to-head; scales from a CSV to a verified 100M-row dedupe in 9.2 min on Ray. Fuzzy/exact/probabilistic + PPRL + LLM, identity graph. Python + edge-safe TypeScript (optional WASM), SQL-native in Postgres & DuckDB, MCP/REST + dbt/Airflow.

112 starsPython

GoldenMatch – 100M-row dedupe on Ray in 213s, no Spark, Arrow-native

by benzsevern·Jun 4, 2026·3 points·0 comments

AI Analysis

●●SolidBig BrainShip It

Ray-based dedupe at 100M rows without Spark — that's a real architectural choice.

Strengths
  • 0.30 GB driver footprint while processing 100M records is genuinely impressive.
  • Polyglot support across Python, TypeScript, PostgreSQL, and DuckDB.
  • MCP servers and dbt recipes show production integration thinking.
Weaknesses
  • Entity resolution already has established players like Splink and Dedupe.io.
  • 74 stars suggests early adoption — real-world scale validation still pending.
Category
Target Audience

Data engineers, data scientists

Similar To

Splink · Dedupe.io · OpenRefine

Similar Projects

Data●●Solid

GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark

Fellegi-Sunter matching with active learning beats Dedupe.io on complex datasets.

Big BrainSolve My Problem
benzsevern
302mo ago
AI/MLMid

Small Model Marketplace W 100M Tokens Free

100M free tokens is generous, but Hugging Face and Replicate already host models.

Niche Gem
robmay
102mo ago