Back to browse
GitHub Repository

LLM inference in C/C++

0 stars

Running LLM on smartwatch – found llama.cpp loading model twice in RAM

by perinban·Apr 2, 2026·1 point·0 comments

AI Analysis

●●●BangerWizardryBig Brain

Found llama.cpp loading models twice in RAM — fixed with host_ptr, 74% reduction.

Strengths
  • Discovered APK mmap page cache + tensor allocations causing double-loading
  • Peak RAM dropped 524MB to 142MB on real Galaxy Watch 4 hardware
  • Second boot time improved from 19s to 2.5s with mmap + KV cache
Weaknesses
  • Fork awaiting upstream merge — not yet in main llama.cpp releases
  • Optimization specific to Android APK deployment, limited broader applicability
Category
Target Audience

Edge ML developers, llama.cpp contributors

Similar To

llama.cpp · MLX · MNN

Post Description

Running SmolLM2 360M on a Samsung Galaxy Watch 4 Classic (380MB free RAM). Found that llama.cpp was loading the model twice simultaneously — APK mmap page cache + its own tensor allocations — peaking at 524MB for a 270MB model.

Fix: added host_ptr to llama_model_params. CPU tensors point directly at the mmap region. Only Vulkan tensors get copied.

Result on real hardware: Peak RAM: 524MB → 142MB (74% reduction) First boot: 19s → 11s Second boot: ~2.5s (mmap + KV cache)

Code: https://github.com/Perinban/llama.cpp/tree/axon-dev

Write-up with VmRSS proof: https://www.linkedin.com/posts/perinban-parameshwaran_machin...

Similar Projects

AI/ML●●Solid

WayInfer – Native GGUF engine that runs models larger than your RAM

Custom GGUF parser with mmap beats llama.cpp load times, but zero stars means unproven claims.

WizardryBold Bet
ahmedm24
102mo ago