Go LLM inference with a Vulkan GPU back end that beats Ollama's CUDA
28% faster Vulkan-to-CUDA on Qwen, but llm.c and llama.cpp already own inference.
Cross-vendor LLM inference based on TRiP using Vulkan compute. Zig + TurboQuant - no CUDA lock-in.
Pure Vulkan compute enables LLMs inside game loops without CUDA lock-in.
Game developers and edge engineers needing cooperative GPU inference
llama.cpp · vLLM · MLC LLM
It's quite different from llama.cpp, vLLM, or ZINC for example.
Written in Zig, and super clean, with no dependencies apart from Vulkan Compute itself of course.
While it supports multiple model families and features like an OpenAI compatibility server, a big thing is you can run it with strict time budgets running inside an existing Vulkan host.. think video games, AR/VR apps, edge devices, or robots.
In addition it supports a rich probe interface to research language model internals at close to real time.
I'm not trying to take on existing runtimes, those exist already. What I am trying to do is help inference work cooperatively with time sensitive applications.
Would love it if you checked it out.
Please let me know if you have any questions or ideas. Thanks!
28% faster Vulkan-to-CUDA on Qwen, but llm.c and llama.cpp already own inference.
Explicit kernel control over TVM-style black boxes, but benchmarks show mixed wins vs Transformers.js.
Meshless Voronoi discretization running realtime on Vulkan is genuinely clever.
Genetic algorithm evolves x86 kernels; runs 80B MoE on single GPU with CPU offload.
Someone rebuilt an inference stack from the ground up in Zig and shipped it as a single binary — including Python bindings, built-in quantization (4/8-bit grouped affine schemes), embeddings, and a plugin-friendly web UI. It’s technically ambitious and immediately useful for anyone wanting local model routing and compact quantized workflows, though GPU support (CUDA) is still on the roadmap and the space is crowded with established alternatives.
Broker-enforced Protobuf schemas shrink payloads 4x, runs in 2.6 MB with service discovery baked in.