Digest AI vs HN About

GitHub Repository

Mamba SSM and Mamba-3 SISO in Rust with optional CUDA GPU acceleration. Inference and training (BPTT through SSM state, AdamW), CPU + GPU paths, custom CUDA kernels, CUDA Graph capture, f32 / bf16 / f16. Batch-invariant bf16 inference — per-row output is bit-identical across batch sizes.

10 starsRust

Mamba SSM in Rust – training and inference with custom CUDA kernels

by silvermpx·Mar 23, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryNiche Gem

Custom CUDA kernels for SSM recurrence with zero framework dependencies.

Strengths

•Full BPTT through recurrent SSM state enables actual training, not just inference.
•Zero-allocation single-step inference hits ~200μs on CPU without GPU.
•Standalone design means no PyTorch, Burn, or Candle dependency chain.

Weaknesses

•Mamba implementations already exist in multiple languages; Rust isn't unique.
•No benchmark comparisons against official Mamba or other ports.

Category

Target Audience

ML engineers wanting Rust-based SSM implementations

Similar To

mamba-minimal · Candle · Burn

Similar Projects

Developer Tools●●●Banger

Trained a 12M transformer on an ML framework we built from scratch

Custom CUDA kernels and Rust backend with a TypeScript API built by students in four months.

WizardryBig BrainShip It

caliandbust

222mo ago

Developer Tools●●●Banger

cuTile Rust: Safe, data-race-free GPU kernels in Rust

Extends Rust's ownership model across GPU boundary with tile-based partitioning for data-race-free kernels.

WizardryBig BrainNiche Gem

melihelibol

106184d ago

Infrastructure●●Solid

ZSE – Single-file LLM engine with dual INT4 kernels

INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.

WizardryShip It

zyoralabs

103mo ago

AI/ML●●●Banger

We built an LLM inference engine in pure Python – no PyTorch, no Triton

30x faster cold start than vLLM with zero PyTorch dependencies.

WizardryBig BrainZero to One

zyoraclub

2019d ago

AI/ML●●●Banger

Glq LLM quantization using E8 lattice

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

WizardryBig Brain

acd

2019d ago

AI/ML●Mid

FlashQwen – A from-scratch CUDA inference engine for Qwen3

Another inference engine when vLLM and llama.cpp already dominate.

Bold BetNiche Gem

langtang1996

505d ago