Back to browse
GitHub Repository

Cross-vendor LLM inference based on TRiP using Vulkan compute. Zig + TurboQuant - no CUDA lock-in.

7 starsZig

Valkyr LM Inference with Realtime Guarantees

by quatonion·May 3, 2026·3 points·0 comments

AI Analysis

●●●BangerWizardryNiche Gem

Pure Vulkan compute enables LLMs inside game loops without CUDA lock-in.

Strengths
  • Single SPIR-V binary runs on NVIDIA, AMD, Intel, and Apple Silicon without backend duplication.
  • Cooperative compute integration allows inference to share queues and command pools with render passes.
  • First public demonstration of TurboQuant KV-cache compression on a non-CUDA backend.
Weaknesses
  • Young project with months of development versus years for mature runtimes like llama.cpp.
  • Vulkan compute performance may lag behind vendor-optimized CUDA kernels for raw throughput.
Category
Target Audience

Game developers and edge engineers needing cooperative GPU inference

Similar To

llama.cpp · vLLM · MLC LLM

Post Description

Valkyr is a fresh take on LM Inference runtimes.

It's quite different from llama.cpp, vLLM, or ZINC for example.

Written in Zig, and super clean, with no dependencies apart from Vulkan Compute itself of course.

While it supports multiple model families and features like an OpenAI compatibility server, a big thing is you can run it with strict time budgets running inside an existing Vulkan host.. think video games, AR/VR apps, edge devices, or robots.

In addition it supports a rich probe interface to research language model internals at close to real time.

I'm not trying to take on existing runtimes, those exist already. What I am trying to do is help inference work cooperatively with time sensitive applications.

Would love it if you checked it out.

Please let me know if you have any questions or ideas. Thanks!

Similar Projects

AI/ML●●Solid

Talu, single-binary, local-first LLM runtime

Someone rebuilt an inference stack from the ground up in Zig and shipped it as a single binary — including Python bindings, built-in quantization (4/8-bit grouped affine schemes), embeddings, and a plugin-friendly web UI. It’s technically ambitious and immediately useful for anyone wanting local model routing and compact quantized workflows, though GPU support (CUDA) is still on the roadmap and the space is crowded with established alternatives.

WizardryNiche Gem
aprxi
204mo ago