Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)

Name: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)
Availability: InStock
Author: fatihturker

by fatihturker·Mar 9, 2026·2 points·1 comment

Visit Project View on HN

AI Analysis

●●SolidBig BrainBold Bet

Native ternary training beats post-training quantization for memory efficiency.

Strengths

•BitNet b1.58 architecture with {-1, 0, +1} weights eliminates float multiplication
•Memory-mapped layer streaming from NVMe shifts bottleneck from RAM to storage bandwidth
•MoE support with 500B total params but only 10-20B active per token

Weaknesses

•Only 7 commits and 0 stars suggests very early experimental stage
•No benchmarks comparing quality loss against standard quantization methods

Post Description

Hi HN,

I’ve been exploring how far large language models can be pushed on machines with limited memory.

I built an experimental runtime and architecture approach focused on making extremely large models more feasible on systems with around 32GB of RAM.

The core idea is combining several efficiency techniques:

ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.

Instead of keeping the entire model in RAM, weights can be streamed from fast SSD storage and unpacked during execution. This shifts the bottleneck from memory capacity toward storage bandwidth and compute efficiency.

Early experiments show significant compression compared to FP16 weights (for example TinyLlama-1.1B shrinking from ~2.05GB to ~0.24GB with ternary packing).

The project is still experimental, but the goal is to explore whether extreme compression + sparsity + SSD streaming can make much larger models practical on consumer machines.

Paper: https://opengraviton.github.io/paper.html

Runtime: https://github.com/opengraviton/graviton-native

I’d really appreciate feedback from people working on inference engines, quantization, or efficient model architectures.