Data Projects

Data tools and analytics from Show HN — visualization, ETL pipelines, databases, and data science utilities.

GitHub

Distributed DataFrames for Elixir powered by DuckDB

77Elixir
●●●Banger

Distributed DuckDB on the Beam

Distributes DuckDB queries across BEAM nodes when Polars maintenance became unsustainable.

Big BrainNiche Gem
cigrainger
702mo ago
GitHub

Open-source streaming SQL engine written in Rust using Apache Arrow and DataFusion. Supports continuous queries, temporal stream joins, tumbling/session windows, and CDC/Kafka connectors. Lightweight, embeddable, and sub-microsecond latency

37Rust
●●●Banger

LaminarDB – Streaming SQL database in Rust, zero-alloc hot path

Sub-microsecond streaming SQL via zero-alloc hot path; genuine advancement over SQLite+DataFusion.

WizardryNiche GemSolve My Problem
sujitn
613mo ago
GitHub

Column-oriented analytics for SQLite, designed for fast local analytical queries with benchmarked speedups of up to 130,000x over standard SQLite scans on selected workloads.

4C
●●●Banger

SQLite-Columnar – Columnar Storage for SQLite

Columnar storage inside SQLite delivers 130,000x speedup on aggregation scans.

WizardryBig Brain
marcobambini
401mo ago
GitHub

Rasteret is a library for 20x+ faster reads of GeoTIFF than Rasterio/GDAL. Interops with TorchGeo, Xarray, DuckDB, Polars

132Python
●●●Banger

Rasteret-library for fast table like access to Satellite Imagery

Parquet-cached COG metadata cuts satellite image cold starts from minutes to seconds.

Niche GemSolve My ProblemWizardry
sid_tf
303mo ago
GitHub

Bankstatemently Open Benchmark for bank statement parsing. 15 synthetic statements, 40 parsing challenges, automated scoring via API.

1Shell
●●●Banger

Open-source synthetic bank statements for testing parsers

First shared benchmark for bank statement parsers when real data is legally off-limits.

Solve My ProblemZero to OneDark Horse
Maesh
302mo ago
GitHub

A new search paradigm where documents have gravity, queries converge into basins, and multi-signal scoring uses interference instead of linear fusion.

36Rust
●●●Banger

I rebuilt search using physics instead of statistics. +18.5% NDCG 10

Gravity-based search with interference scoring beats linear fusion by 18.5% NDCG.

Big BrainWizardryZero to One
KendallCBooker
202mo ago
GitHub

Query the full Hacker News archive from Postgres via duckdb_fdw, with zero copies. Stream row groups straight from the Hugging Face Parquet dataset on demand.

4Python
●●●Banger

HN-fdw – All of Hacker News, queryable from Postgres, with zero copies

Zero-copy Postgres queries against 47M rows using DuckDB FDW and HTTP range requests.

WizardryBig BrainDark Horse
tamnd
202mo ago
GitHub

An experimental embedded SQL engine in C++20. Query Parquet, CSV, JSON, Arrow, Avro, SQLite, and Excel files directly with SQL, in-process. Early-stage.

493C++
●●●Banger

SlothDB is a super fast embedded SQL database

1.3MB WASM database beats DuckDB 5x on 10M row analytics, runs in Cloudflare Workers.

WizardryBig BrainZero to One
souravroy78
221mo ago