Back to browse
GitHub Repository

SCAO is a sparse, second-order PyTorch optimizer designed as a high-throughput, drop-in replacement for AdamW.

9 starsPython

I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs

by dnosoz·Apr 29, 2026·2 points·4 comments

AI Analysis

●●●BangerWizardryBig Brain

Runs Shampoo-quality second-order optimization on a 16GB T4 where others OOM immediately.

Strengths
  • Adaptive rank selection truncates eigenspace to retain spectral mass while fitting VRAM.
  • Int8 EMA quantization reduces curvature accumulator memory by 4x with zero perplexity loss.
  • Drop-in AdamW replacement that stabilizes training for 3B+ parameter models on single cards.
Weaknesses
  • Throughput overhead from preconditioner calculation may still lag behind pure first-order methods.
  • Benchmarks currently limited to specific model architectures like Qwen and TinyStories.
Category
Target Audience

ML engineers fine-tuning LLMs on consumer hardware

Similar To

bitsandbytes · DeepSpeed

Post Description

Hi HN,

I'm Danilo. I've been struggling with the limitations of AdamW when fine-tuning LLMs locally. Second-order optimizers (like Shampoo or SOAP) offer significantly better step-convergence by exploiting Kronecker-factored curvature. The problem? They require O(d^2) memory and O(d^3) compute per layer, which immediately OOMs consumer hardware like a 16GB T4 or RTX 3090.

I wanted Shampoo-quality preconditioning on my home setup, so I built SCAO (Sparse Curvature-Aware Optimizer).

It's a PyTorch optimizer that acts as a drop-in replacement for AdamW, but it implements a few strict architectural changes to survive on consumer cards:

1. Adaptive Rank Selection: Instead of full-rank Kronecker factors, it truncates the eigenspace to retain >=95% of spectral mass. 2. Int8 EMA Quantization: The curvature accumulators are stored in symmetric int8, which yields a 4x memory reduction with zero degradation in perplexity. 3. Quantization Stability: Standard Shampoo usually crashes at step 1 during 4-bit QLoRA fine-tuning due to SVD ill-conditioning in quantized spaces. SCAO exploits sparse approximations to bypass this. 4. Fused CUDA kernels: I wrote custom kernels to fix an O(k * m^2 * n) complexity bottleneck in the naive projection implementation.

The Benchmark: I recently ran a head-to-head benchmark on a single T4 (16GB VRAM) fine-tuning Qwen2.5-3B (4-bit QLoRA, rank 16): - Shampoo: Failed at Step 1 (SVD mathematical collapse). - SCAO: 100% stability, peaked at exactly 7.14 GB VRAM, with a smooth loss descent.

It is pip-installable (pip install scao).

I've written a technical report detailing the regret bounds, ablation studies, and scaling laws (published on Zenodo), but I really wanted to get this community's eyes on the CUDA kernels and the PyTorch implementation.

GitHub: https://github.com/whispering3/scao Technical Report (DOI): https://doi.org/10.5281/zenodo.19870556

I'd love any feedback, code roasts, or questions about the math behind it!

Similar Projects

AI/ML●●●Banger

MaximusLLM – Train 262k-vocab LLMs on a single 16GB GPU

Ghost Logit math bypasses 262k vocab OOM without materializing full matrices.

Big BrainWizardryZero to One
yousef_g
203mo ago
AI/ML●●●Banger

Auto GPU Kernel – Autonomous GPU-kernel discovery and optimizer

Autonomous kernel optimizer that won MLSys contest with 34.93x speedup.

WizardryBig BrainBold Bet
dogacel
1023d ago
AI/ML●●Solid

Profine – optimize your PyTorch training script before the run

Automated PyTorch optimizer delivering 3x speedups before you waste cloud credits.

Solve My ProblemBig Brain
aisinghal
301mo ago