GitHub Repository

AI-native OS kernel written from scratch in C and x86_64/aarch64 assembly — kernel-level tensor compute, capability-based security, SMP, TCP/IP, and 95 userland programs

0 starsC

WayInfer – Native GGUF engine that runs models larger than your RAM

Name: WayInfer – Native GGUF engine that runs models larger than your RAM
Availability: InStock
Author: ahmedm24

by ahmedm24·Apr 2, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryBold Bet

Custom GGUF parser with mmap beats llama.cpp load times, but zero stars means unproven claims.

Strengths

•Memory-mapped I/O lets OS page weights on-demand instead of full RAM load
•Quantized-domain dot products avoid dequantization overhead to float32
•Zero dependencies on ggml/llama.cpp with custom tensor engine

Weaknesses

•Confusing dual presentation as both inference engine and full OS kernel
•Zero stars and forks means claims need independent verification

Post Description

We built a native inference engine that runs quantized LLMs directly from SSD using memory-mapped I/O. The model never fully loads into RAM — the OS pages weights on demand as each layer executes.

*What it does:* - Mixtral 8x22B (80GB, 141B params) runs on a machine with 48GB RAM - Model loads in 0.3 seconds (vs 190s with llama.cpp) - Produces correct output: "What is 2+2?" → "The sum of 2 and 2 is 4." - Zero dependencies — custom tensor engine, custom GGUF parser, no ggml/llama.cpp

*How it works:* - `mmap()` the GGUF file. The OS handles SSD→RAM paging transparently - Quantize the input to Q8_K, compute dot products directly against Q4_K/Q5_K/Q6_K weights in the quantized domain — no dequantization to float32 - AVX2 SIMD + 8-thread parallel matvec - For MoE models: only 2 of 8 experts are active per token, so most weights stay cold on disk

*The hard part we solved:* GGUF models are calibrated for a specific dot product computation path (ggml's "quantize input → integer multiply-accumulate → late float conversion"). If you naively dequantize weights to float32 and do a standard dot product, the per-operation error is tiny (~0.001%) but compounds across 56 transformer layers into completely wrong output. We had to reverse-engineer and match ggml's exact scalar computation — block-level integer accumulation with 8-lane parallel reduction — to get correct results.

*What it doesn't do (yet):* - Speed: ~0.08 tok/s on the 80GB model (CPU-only, no GPU offload) - No interactive chat UI - Only K-quant GGUF formats (Q4_K_M, Q5_K_M, Q6_K — covers ~90% of models on HuggingFace) - Windows only (Linux stubs exist but untested)

The architecture comes from my "work in progress" WayOS (https://github.com/cloudlinqed/WayOS), an AI-first OS that treats SSD/RAM/VRAM as a unified memory hierarchy.

GitHub: https://github.com/cloudlinqed/WayInfer