Back to browse
ROLV – 20x faster MoE FFN inference on Llama 4 Maverick vs. cuBLAS

ROLV – 20x faster MoE FFN inference on Llama 4 Maverick vs. cuBLAS

by heggenhougen·Mar 9, 2026·1 point·1 comment

AI Analysis

●●●●GemWizardryZero to One

20x faster MoE inference on existing hardware with hash-verified output correctness.

Strengths
  • Hash-verified deterministic output ensures mathematical equivalence to dense compute.
  • Validated by University of Miami with specific B200 benchmarks.
Weaknesses
  • Patents pending may limit open-source adoption or broader community contribution.
  • Claims are extraordinary and require independent reproduction to fully trust.
Category
Target Audience

AI infrastructure engineers, ML researchers, hyperscalers

Similar To

cuBLAS · vLLM · TensorRT-LLM

Post Description

I benchmarked ROLV against dense cuBLAS on the actual Llama 4 Maverick MoE expert FFN layer (up_proj, 16384×5120, bfloat16) pulled directly from HuggingFace (model-00001-of-00084.safetensors). Numbers (Batch=512, 1000 iters, NVIDIA B200): Tokens/s: 369K (cuBLAS) → 7.66M (ROLV) — 20.7x faster TFLOPS (effective): 62 → 1,285 — 20.7x Time to First Token: 64.8ms → 0.37ms — 177x faster Energy: 232J → 43J — 81.5% savings ROLV exploits structured sparsity in MoE expert weights to skip large blocks of computation entirely, while producing canonically equivalent output (hash-verified). The TFLOPS figure is "effective" — computed as if doing the full dense multiply — so the 1285 TFLOPS isn't violating hardware peak; it's reflecting how much work was avoided. The TTFT speedup (177x) is especially relevant for interactive inference: MoE models spend a huge fraction of first-token latency in these expert projections, and collapsing that from 65ms to 0.4ms per layer changes what's possible for real-time applications. Verified with norm hashes at both ends (baseline and ROLV output) and a canonical check. Weights are real, not synthetic. Setup: PyTorch 2.8.0+cu128, CUDA 12.8, Python 3.12, NVIDIA B200.

Similar Projects

Developer Tools●●Solid

Rev-dep – 20x faster knip.dev alternative build in Go

20x faster knip—performance leap is real, but dependency linters are crowded and knip already solved this.

Ship ItSolve My Problem
jayu_dev
46133mo ago
AI/ML●●●Banger

SwiftLM – Qwen Chat on iPhone, 100B+ Moe on M5 Pro 64GB (Native Swift)

Native Swift inference with SSD streaming runs 100B MoE models without kernel panics.

WizardryNiche Gem
aegis_camera
122mo ago