Valkyr LM Inference with Realtime Guarantees
Pure Vulkan compute enables LLMs inside game loops without CUDA lock-in.
Golang inference engine and deep learning primitives
28% faster Vulkan-to-CUDA on Qwen, but llm.c and llama.cpp already own inference.
ML engineers, systems developers, edge/offline inference users, Go ecosystem
Ollama · llama.cpp · vLLM
I benchmarked it against Ollama using the exact same GGUF files on an RTX 4070 Ti SUPER:
GPU (dlgo Vulkan vs Ollama CUDA):
Qwen3.5 0.8B: 239 tok/s vs 187 tok/s — 28% faster Gemma 3 270M: 456 tok/s vs 503 tok/s (−9%) SmolLM2 360M: 420 tok/s vs 451 tok/s (−7%) 10 models tested, within 7–25% of CUDA on standard architectures CPU (dlgo vs Ollama, same GGUF):
6 of 10 models within 9% of Ollama 2 models faster (Gemma 270M +3%, SmolLM2 360M +7%) The Qwen3.5 result surprised me. Qwen3.5 uses a hybrid Gated Delta Net + attention architecture (SSM layers with a recurrent delta rule). I wrote 6 custom Vulkan compute shaders for it — conv1d, delta rule recurrence, L2 normalization, sigmoid gating — and the fused Vulkan pipeline ended up outperforming llama.cpp's CUDA kernels.
Vulkan means this runs on AMD, Intel, and mobile GPUs too — not just NVIDIA. Ollama's own Vulkan backend is 66–126% slower than dlgo on the models I tested.
Supports LLaMA, Qwen2/3/3.5, Gemma, Phi, SmolLM2, Mistral, plus Whisper speech-to-text. 25+ quantization formats (Q4_0 through Q8_0, all K-quants).
Three lines to run:
model, _ := dlgo.LoadLLM("model.gguf") response, _ := model.Chat("", "What is the capital of France?") fmt.Println(response)
Pure Vulkan compute enables LLMs inside game loops without CUDA lock-in.
Build vLLM from scratch with PagedAttention kernels when llama.cpp already exists.
94% GPU reduction claim needs verifiable benchmarks to stand out.
Read-only GPU waste scanner finds 20-40% cluster spend waste without agents or sidecars.
One-command benchmark suite comparing Ollama and XGBoost performance with a shared Streamlit dashboard.
One-command GPU waste scanner when Kubecost requires full Prometheus setup.