Back to browse
GitHub Repository

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM

808 starsC++

Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

by yu3zhou4·May 29, 2026·205 points·18 comments

AI Analysis

●●SolidBig BrainNiche Gem

Build vLLM from scratch with PagedAttention kernels when llama.cpp already exists.

Strengths
  • Complete implementation with PagedAttention, FlashAttention-like online softmax, and continuous batching
  • Course structure teaches CUDA from embeddings through full inference pipeline
  • Working code loads real Llama 3.2 1B models from Safetensors
Weaknesses
  • Educational focus means not production-ready for actual deployment
  • Only supports Llama 3.2 1B, limited model range compared to vLLM
Category
Target Audience

Developers learning CUDA kernel engineering and LLM inference internals

Similar To

vLLM · llama.cpp · mlc-llm

Similar Projects

AI/ML●●●Banger

Glq LLM quantization using E8 lattice

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

WizardryBig Brain
acd
2019d ago