Back to browse
GitHub Repository

Golang inference engine and deep learning primitives

28 starsC

Go LLM inference with a Vulkan GPU back end that beats Ollama's CUDA

by computerex·Mar 8, 2026·1 point·0 comments

AI Analysis

●●SolidWizardryBig BrainNiche Gem

28% faster Vulkan-to-CUDA on Qwen, but llm.c and llama.cpp already own inference.

Strengths
  • Custom Vulkan compute shaders for Qwen3.5's SSM/GDN architecture (conv1d, delta rule, L2 norm, sigmoid gating). Real architectural insight, not just porting.
  • Cross-GPU: Vulkan runs on NVIDIA, AMD, Intel without CUDA lock-in. Pure Go with optional CGo, zero standard-library-only path for CPU.
  • Rigorous benchmarks: same GGUF files, head-to-head vs Ollama CPU and GPU. MoE support with 256-expert dispatch.
Weaknesses
  • Ollama already solves this for most users. Performance parity on 7/10 models, slower on some. Vulkan adoption friction vs CUDA maturity.
  • Niche: matters if you need Go+Vulkan or cross-GPU portability. Most inference workloads use Python (vLLM, SGLang).
Target Audience

ML engineers, systems developers, edge/offline inference users, Go ecosystem

Similar To

Ollama · llama.cpp · vLLM

Post Description

dlgo is an LLM inference engine written in Go. CPU path has zero dependencies beyond the standard library. GPU path uses Vulkan compute — no CUDA required.

I benchmarked it against Ollama using the exact same GGUF files on an RTX 4070 Ti SUPER:

GPU (dlgo Vulkan vs Ollama CUDA):

Qwen3.5 0.8B: 239 tok/s vs 187 tok/s — 28% faster Gemma 3 270M: 456 tok/s vs 503 tok/s (−9%) SmolLM2 360M: 420 tok/s vs 451 tok/s (−7%) 10 models tested, within 7–25% of CUDA on standard architectures CPU (dlgo vs Ollama, same GGUF):

6 of 10 models within 9% of Ollama 2 models faster (Gemma 270M +3%, SmolLM2 360M +7%) The Qwen3.5 result surprised me. Qwen3.5 uses a hybrid Gated Delta Net + attention architecture (SSM layers with a recurrent delta rule). I wrote 6 custom Vulkan compute shaders for it — conv1d, delta rule recurrence, L2 normalization, sigmoid gating — and the fused Vulkan pipeline ended up outperforming llama.cpp's CUDA kernels.

Vulkan means this runs on AMD, Intel, and mobile GPUs too — not just NVIDIA. Ollama's own Vulkan backend is 66–126% slower than dlgo on the models I tested.

Supports LLaMA, Qwen2/3/3.5, Gemma, Phi, SmolLM2, Mistral, plus Whisper speech-to-text. 25+ quantization formats (Q4_0 through Q8_0, all K-quants).

Three lines to run:

model, _ := dlgo.LoadLLM("model.gguf") response, _ := model.Chat("", "What is the capital of France?") fmt.Println(response)

Similar Projects

AI/ML●●●Banger

Valkyr LM Inference with Realtime Guarantees

Pure Vulkan compute enables LLMs inside game loops without CUDA lock-in.

WizardryNiche Gem
quatonion
301mo ago