Back to browse
GitHub Repository

Efficient LLM architectures for 32GB RAM — BitNet & MoE

4 starsPython

Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)

by fatihturker·Mar 9, 2026·2 points·1 comment

AI Analysis

●●SolidBig BrainBold Bet

Native ternary training beats post-training quantization for memory efficiency.

Strengths
  • BitNet b1.58 architecture with {-1, 0, +1} weights eliminates float multiplication
  • Memory-mapped layer streaming from NVMe shifts bottleneck from RAM to storage bandwidth
  • MoE support with 500B total params but only 10-20B active per token
Weaknesses
  • Only 7 commits and 0 stars suggests very early experimental stage
  • No benchmarks comparing quality loss against standard quantization methods
Category
Target Audience

ML researchers and engineers working on model efficiency

Similar To

BitNet · llama.cpp · MLX

Post Description

Hi HN,

I’ve been exploring how far large language models can be pushed on machines with limited memory.

I built an experimental runtime and architecture approach focused on making extremely large models more feasible on systems with around 32GB of RAM.

The core idea is combining several efficiency techniques:

ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.

Instead of keeping the entire model in RAM, weights can be streamed from fast SSD storage and unpacked during execution. This shifts the bottleneck from memory capacity toward storage bandwidth and compute efficiency.

Early experiments show significant compression compared to FP16 weights (for example TinyLlama-1.1B shrinking from ~2.05GB to ~0.24GB with ternary packing).

The project is still experimental, but the goal is to explore whether extreme compression + sparsity + SSD streaming can make much larger models practical on consumer machines.

Paper: https://opengraviton.github.io/paper.html

Runtime: https://github.com/opengraviton/graviton-native

I’d really appreciate feedback from people working on inference engines, quantization, or efficient model architectures.

Similar Projects

AI/ML●●●Banger

Turbo1Bit – Run Bonsai-8B at 65K context in 3.9 GB RAM

Runs 65K context on 8GB RAM by fixing KV cache quantization for Bonsai.

WizardryNiche Gem
tetsuto
102mo ago
AI/ML●●Solid

WayInfer – Native GGUF engine that runs models larger than your RAM

Custom GGUF parser with mmap beats llama.cpp load times, but zero stars means unproven claims.

WizardryBold Bet
ahmedm24
102mo ago