Turbo1Bit – Run Bonsai-8B at 65K context in 3.9 GB RAM
Runs 65K context on 8GB RAM by fixing KV cache quantization for Bonsai.
Efficient LLM architectures for 32GB RAM — BitNet & MoE
Native ternary training beats post-training quantization for memory efficiency.
ML researchers and engineers working on model efficiency
BitNet · llama.cpp · MLX
I’ve been exploring how far large language models can be pushed on machines with limited memory.
I built an experimental runtime and architecture approach focused on making extremely large models more feasible on systems with around 32GB of RAM.
The core idea is combining several efficiency techniques:
ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.
Instead of keeping the entire model in RAM, weights can be streamed from fast SSD storage and unpacked during execution. This shifts the bottleneck from memory capacity toward storage bandwidth and compute efficiency.
Early experiments show significant compression compared to FP16 weights (for example TinyLlama-1.1B shrinking from ~2.05GB to ~0.24GB with ternary packing).
The project is still experimental, but the goal is to explore whether extreme compression + sparsity + SSD streaming can make much larger models practical on consumer machines.
Paper: https://opengraviton.github.io/paper.html
Runtime: https://github.com/opengraviton/graviton-native
I’d really appreciate feedback from people working on inference engines, quantization, or efficient model architectures.
Runs 65K context on 8GB RAM by fixing KV cache quantization for Bonsai.
1.58-bit quantization + layer streaming shrinks 144GB models to 36GB, runs on Mac Mini.
Custom GGUF parser with mmap beats llama.cpp load times, but zero stars means unproven claims.
INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.
Interactive Karpathy lecture visualizer when static blog posts already explain this.
Pure Go LLM inference, zero dependencies, 48 tok/s—genuinely novel for Go ecosystem.