GitHub Repository

SCAO is a sparse, second-order PyTorch optimizer designed as a high-throughput, drop-in replacement for AdamW.

9 starsPython

I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs

Name: I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs
Availability: InStock
Author: dnosoz

by dnosoz·Apr 29, 2026·2 points·4 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBig Brain

Runs Shampoo-quality second-order optimization on a 16GB T4 where others OOM immediately.

Strengths

•Adaptive rank selection truncates eigenspace to retain spectral mass while fitting VRAM.
•Int8 EMA quantization reduces curvature accumulator memory by 4x with zero perplexity loss.
•Drop-in AdamW replacement that stabilizes training for 3B+ parameter models on single cards.

Weaknesses

•Throughput overhead from preconditioner calculation may still lag behind pure first-order methods.
•Benchmarks currently limited to specific model architectures like Qwen and TinyStories.

Post Description

Hi HN,

I'm Danilo. I've been struggling with the limitations of AdamW when fine-tuning LLMs locally. Second-order optimizers (like Shampoo or SOAP) offer significantly better step-convergence by exploiting Kronecker-factored curvature. The problem? They require O(d^2) memory and O(d^3) compute per layer, which immediately OOMs consumer hardware like a 16GB T4 or RTX 3090.

I wanted Shampoo-quality preconditioning on my home setup, so I built SCAO (Sparse Curvature-Aware Optimizer).

It's a PyTorch optimizer that acts as a drop-in replacement for AdamW, but it implements a few strict architectural changes to survive on consumer cards:

1. Adaptive Rank Selection: Instead of full-rank Kronecker factors, it truncates the eigenspace to retain >=95% of spectral mass. 2. Int8 EMA Quantization: The curvature accumulators are stored in symmetric int8, which yields a 4x memory reduction with zero degradation in perplexity. 3. Quantization Stability: Standard Shampoo usually crashes at step 1 during 4-bit QLoRA fine-tuning due to SVD ill-conditioning in quantized spaces. SCAO exploits sparse approximations to bypass this. 4. Fused CUDA kernels: I wrote custom kernels to fix an O(k * m^2 * n) complexity bottleneck in the naive projection implementation.

The Benchmark: I recently ran a head-to-head benchmark on a single T4 (16GB VRAM) fine-tuning Qwen2.5-3B (4-bit QLoRA, rank 16): - Shampoo: Failed at Step 1 (SVD mathematical collapse). - SCAO: 100% stability, peaked at exactly 7.14 GB VRAM, with a smooth loss descent.

It is pip-installable (pip install scao).

I've written a technical report detailing the regret bounds, ablation studies, and scaling laws (published on Zenodo), but I really wanted to get this community's eyes on the CUDA kernels and the PyTorch implementation.

GitHub: https://github.com/whispering3/scao Technical Report (DOI): https://doi.org/10.5281/zenodo.19870556

I'd love any feedback, code roasts, or questions about the math behind it!