Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)
450k context on 32GB VRAM using turboquant KV cache compression.
LLM inference in C/C++
Found llama.cpp loading models twice in RAM — fixed with host_ptr, 74% reduction.
Edge ML developers, llama.cpp contributors
llama.cpp · MLX · MNN
Fix: added host_ptr to llama_model_params. CPU tensors point directly at the mmap region. Only Vulkan tensors get copied.
Result on real hardware: Peak RAM: 524MB → 142MB (74% reduction) First boot: 19s → 11s Second boot: ~2.5s (mmap + KV cache)
Code: https://github.com/Perinban/llama.cpp/tree/axon-dev
Write-up with VmRSS proof: https://www.linkedin.com/posts/perinban-parameshwaran_machin...
450k context on 32GB VRAM using turboquant KV cache compression.
In-process LLM inference in PHP beats the usual Python sidecar pattern.
Finally one CLI for Ollama, llama.cpp, and vLLM instead of three separate tools.
Useful tutorial, but llama.cpp docs and Ollama already cover most of this.
Custom GGUF parser with mmap beats llama.cpp load times, but zero stars means unproven claims.
One YAML config for three backends when Ollama already handles llama.cpp alone.