MaximusLLM – Train 262k-vocab LLMs on a single 16GB GPU
Ghost Logit math bypasses 262k vocab OOM without materializing full matrices.
SCAO is a sparse, second-order PyTorch optimizer designed as a high-throughput, drop-in replacement for AdamW.
Runs Shampoo-quality second-order optimization on a 16GB T4 where others OOM immediately.
ML engineers fine-tuning LLMs on consumer hardware
bitsandbytes · DeepSpeed
I'm Danilo. I've been struggling with the limitations of AdamW when fine-tuning LLMs locally. Second-order optimizers (like Shampoo or SOAP) offer significantly better step-convergence by exploiting Kronecker-factored curvature. The problem? They require O(d^2) memory and O(d^3) compute per layer, which immediately OOMs consumer hardware like a 16GB T4 or RTX 3090.
I wanted Shampoo-quality preconditioning on my home setup, so I built SCAO (Sparse Curvature-Aware Optimizer).
It's a PyTorch optimizer that acts as a drop-in replacement for AdamW, but it implements a few strict architectural changes to survive on consumer cards:
1. Adaptive Rank Selection: Instead of full-rank Kronecker factors, it truncates the eigenspace to retain >=95% of spectral mass. 2. Int8 EMA Quantization: The curvature accumulators are stored in symmetric int8, which yields a 4x memory reduction with zero degradation in perplexity. 3. Quantization Stability: Standard Shampoo usually crashes at step 1 during 4-bit QLoRA fine-tuning due to SVD ill-conditioning in quantized spaces. SCAO exploits sparse approximations to bypass this. 4. Fused CUDA kernels: I wrote custom kernels to fix an O(k * m^2 * n) complexity bottleneck in the naive projection implementation.
The Benchmark: I recently ran a head-to-head benchmark on a single T4 (16GB VRAM) fine-tuning Qwen2.5-3B (4-bit QLoRA, rank 16): - Shampoo: Failed at Step 1 (SVD mathematical collapse). - SCAO: 100% stability, peaked at exactly 7.14 GB VRAM, with a smooth loss descent.
It is pip-installable (pip install scao).
I've written a technical report detailing the regret bounds, ablation studies, and scaling laws (published on Zenodo), but I really wanted to get this community's eyes on the CUDA kernels and the PyTorch implementation.
GitHub: https://github.com/whispering3/scao Technical Report (DOI): https://doi.org/10.5281/zenodo.19870556
I'd love any feedback, code roasts, or questions about the math behind it!
Ghost Logit math bypasses 262k vocab OOM without materializing full matrices.
Custom CPU kernels for sparse training when everyone else chases GPU.
Beats PyTorch eager by 5.29x on RMSNorm using autonomous agent loops.
Autonomous kernel optimizer that won MLSys contest with 34.93x speedup.
Automated PyTorch optimizer delivering 3x speedups before you waste cloud credits.
Automates the painful torch.compile and mixed-precision tuning loop with measured 3x speedups.