GitHub Repository

LLM inference in C/C++

0 stars

Running LLM on smartwatch – found llama.cpp loading model twice in RAM

Name: Running LLM on smartwatch – found llama.cpp loading model twice in RAM
Availability: InStock
Author: perinban

by perinban·Apr 2, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBig Brain

Found llama.cpp loading models twice in RAM — fixed with host_ptr, 74% reduction.

Strengths

•Discovered APK mmap page cache + tensor allocations causing double-loading
•Peak RAM dropped 524MB to 142MB on real Galaxy Watch 4 hardware
•Second boot time improved from 19s to 2.5s with mmap + KV cache

Weaknesses

•Fork awaiting upstream merge — not yet in main llama.cpp releases
•Optimization specific to Android APK deployment, limited broader applicability

Post Description

Running SmolLM2 360M on a Samsung Galaxy Watch 4 Classic (380MB free RAM). Found that llama.cpp was loading the model twice simultaneously — APK mmap page cache + its own tensor allocations — peaking at 524MB for a 270MB model.

Fix: added host_ptr to llama_model_params. CPU tensors point directly at the mmap region. Only Vulkan tensors get copied.

Result on real hardware: Peak RAM: 524MB → 142MB (74% reduction) First boot: 19s → 11s Second boot: ~2.5s (mmap + KV cache)

Code: https://github.com/Perinban/llama.cpp/tree/axon-dev

Write-up with VmRSS proof: https://www.linkedin.com/posts/perinban-parameshwaran_machin...

Similar Projects

AI/ML●●Solid

Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)

450k context on 32GB VRAM using turboquant KV cache compression.

Big BrainNiche Gem

utopman

227d ago

Developer Tools●●Solid

Ext-Infer – Native LLM Inference and Embeddings for PHP

In-process LLM inference in PHP beats the usual Python sidecar pattern.

Big BrainNiche Gem

eamann

208d ago

Developer Tools●●●Banger

A single CLI to manage llama.cpp/vLLM/Ollama models

Finally one CLI for Ollama, llama.cpp, and vLLM instead of three separate tools.

Solve My ProblemSlick

everlier

213mo ago

Education●Mid

Llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

Useful tutorial, but llama.cpp docs and Ollama already cover most of this.

Niche Gem

anju-kushwaha

1341mo ago

AI/ML●●Solid

WayInfer – Native GGUF engine that runs models larger than your RAM

Custom GGUF parser with mmap beats llama.cpp load times, but zero stars means unproven claims.

WizardryBold Bet

ahmedm24

102mo ago

Developer Tools●●Solid

Llmconfig – configfile and CLI for local LLM

One YAML config for three backends when Ollama already handles llama.cpp alone.

Niche GemShip It

kilic

301mo ago