ROLV – 20x faster MoE FFN inference on Llama 4 Maverick vs. cuBLAS

Name: ROLV – 20x faster MoE FFN inference on Llama 4 Maverick vs. cuBLAS
Availability: InStock
Author: heggenhougen

by heggenhougen·Mar 9, 2026·1 point·1 comment

Visit Project View on HN

AI Analysis

●●●●GemWizardryZero to One

20x faster MoE inference on existing hardware with hash-verified output correctness.

Strengths

•Hash-verified deterministic output ensures mathematical equivalence to dense compute.
•Validated by University of Miami with specific B200 benchmarks.

Weaknesses

•Patents pending may limit open-source adoption or broader community contribution.
•Claims are extraordinary and require independent reproduction to fully trust.

Post Description

I benchmarked ROLV against dense cuBLAS on the actual Llama 4 Maverick MoE expert FFN layer (up_proj, 16384×5120, bfloat16) pulled directly from HuggingFace (model-00001-of-00084.safetensors). Numbers (Batch=512, 1000 iters, NVIDIA B200): Tokens/s: 369K (cuBLAS) → 7.66M (ROLV) — 20.7x faster TFLOPS (effective): 62 → 1,285 — 20.7x Time to First Token: 64.8ms → 0.37ms — 177x faster Energy: 232J → 43J — 81.5% savings ROLV exploits structured sparsity in MoE expert weights to skip large blocks of computation entirely, while producing canonically equivalent output (hash-verified). The TFLOPS figure is "effective" — computed as if doing the full dense multiply — so the 1285 TFLOPS isn't violating hardware peak; it's reflecting how much work was avoided. The TTFT speedup (177x) is especially relevant for interactive inference: MoE models spend a huge fraction of first-token latency in these expert projections, and collapsing that from 65ms to 0.4ms per layer changes what's possible for real-time applications. Verified with norm hashes at both ends (baseline and ROLV output) and a canonical check. Weights are real, not synthetic. Setup: PyTorch 2.8.0+cu128, CUDA 12.8, Python 3.12, NVIDIA B200.