Trained a 12M transformer on an ML framework we built from scratch
Custom CUDA kernels and Rust backend with a TypeScript API built by students in four months.
Mamba SSM and Mamba-3 SISO in Rust with optional CUDA GPU acceleration. Inference and training (BPTT through SSM state, AdamW), CPU + GPU paths, custom CUDA kernels, CUDA Graph capture, f32 / bf16 / f16. Batch-invariant bf16 inference — per-row output is bit-identical across batch sizes.
Custom CUDA kernels for SSM recurrence with zero framework dependencies.
ML engineers wanting Rust-based SSM implementations
mamba-minimal · Candle · Burn
Custom CUDA kernels and Rust backend with a TypeScript API built by students in four months.
Extends Rust's ownership model across GPU boundary with tile-based partitioning for data-race-free kernels.
INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.
30x faster cold start than vLLM with zero PyTorch dependencies.
E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.
Another inference engine when vLLM and llama.cpp already dominate.