Digest AI vs HN About

GitHub Repository

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM

808 starsC++

Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

by yu3zhou4·May 29, 2026·205 points·18 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainNiche Gem

Build vLLM from scratch with PagedAttention kernels when llama.cpp already exists.

Strengths

•Complete implementation with PagedAttention, FlashAttention-like online softmax, and continuous batching
•Course structure teaches CUDA from embeddings through full inference pipeline
•Working code loads real Llama 3.2 1B models from Safetensors

Weaknesses

•Educational focus means not production-ready for actual deployment
•Only supports Llama 3.2 1B, limited model range compared to vLLM

Category

Target Audience

Developers learning CUDA kernel engineering and LLM inference internals

Similar To

vLLM · llama.cpp · mlc-llm

Similar Projects

Developer Tools●●●Banger

Llmtop – Htop for LLM Inference Clusters (vLLM, SGLang, Nim, Ollama,)

htop for vLLM clusters without the Prometheus overhead.

Niche GemSolve My ProblemShip It

rpotluri

503mo ago

Education●●Solid

Hands-on course for building RL environments for LLMs

Teaches LLM RL training with working Tic Tac Toe demo that beats gpt-5-mini.

Niche GemRabbit Hole

anakin87

112mo ago

Infrastructure●●Solid

Go LLM inference with a Vulkan GPU back end that beats Ollama's CUDA

28% faster Vulkan-to-CUDA on Qwen, but llm.c and llama.cpp already own inference.

WizardryBig BrainNiche Gem

computerex

103mo ago

AI/ML●●●Banger

We built an LLM inference engine in pure Python – no PyTorch, no Triton

30x faster cold start than vLLM with zero PyTorch dependencies.

WizardryBig BrainZero to One

zyoraclub

2018d ago

AI/ML●●●Banger

Glq LLM quantization using E8 lattice

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

WizardryBig Brain

acd

2019d ago

Infrastructure●●Solid

ZSE – Single-file LLM engine with dual INT4 kernels

INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.

WizardryShip It

zyoralabs

103mo ago