Go LLM inference with a Vulkan GPU back end that beats Ollama's CUDA

Name: Go LLM inference with a Vulkan GPU back end that beats Ollama's CUDA
Availability: InStock
Author: computerex

by computerex·Mar 8, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryBig BrainNiche Gem

28% faster Vulkan-to-CUDA on Qwen, but llm.c and llama.cpp already own inference.

Strengths

•Custom Vulkan compute shaders for Qwen3.5's SSM/GDN architecture (conv1d, delta rule, L2 norm, sigmoid gating). Real architectural insight, not just porting.
•Cross-GPU: Vulkan runs on NVIDIA, AMD, Intel without CUDA lock-in. Pure Go with optional CGo, zero standard-library-only path for CPU.
•Rigorous benchmarks: same GGUF files, head-to-head vs Ollama CPU and GPU. MoE support with 256-expert dispatch.

Weaknesses

•Ollama already solves this for most users. Performance parity on 7/10 models, slower on some. Vulkan adoption friction vs CUDA maturity.
•Niche: matters if you need Go+Vulkan or cross-GPU portability. Most inference workloads use Python (vLLM, SGLang).

Post Description

dlgo is an LLM inference engine written in Go. CPU path has zero dependencies beyond the standard library. GPU path uses Vulkan compute — no CUDA required.

I benchmarked it against Ollama using the exact same GGUF files on an RTX 4070 Ti SUPER:

GPU (dlgo Vulkan vs Ollama CUDA):

Qwen3.5 0.8B: 239 tok/s vs 187 tok/s — 28% faster Gemma 3 270M: 456 tok/s vs 503 tok/s (−9%) SmolLM2 360M: 420 tok/s vs 451 tok/s (−7%) 10 models tested, within 7–25% of CUDA on standard architectures CPU (dlgo vs Ollama, same GGUF):

6 of 10 models within 9% of Ollama 2 models faster (Gemma 270M +3%, SmolLM2 360M +7%) The Qwen3.5 result surprised me. Qwen3.5 uses a hybrid Gated Delta Net + attention architecture (SSM layers with a recurrent delta rule). I wrote 6 custom Vulkan compute shaders for it — conv1d, delta rule recurrence, L2 normalization, sigmoid gating — and the fused Vulkan pipeline ended up outperforming llama.cpp's CUDA kernels.

Vulkan means this runs on AMD, Intel, and mobile GPUs too — not just NVIDIA. Ollama's own Vulkan backend is 66–126% slower than dlgo on the models I tested.

Supports LLaMA, Qwen2/3/3.5, Gemma, Phi, SmolLM2, Mistral, plus Whisper speech-to-text. 25+ quantization formats (Q4_0 through Q8_0, all K-quants).

Three lines to run:

model, _ := dlgo.LoadLLM("model.gguf") response, _ := model.Chat("", "What is the capital of France?") fmt.Println(response)