GitHub Repository

Cross-vendor LLM inference based on TRiP using Vulkan compute. Zig + TurboQuant - no CUDA lock-in.

7 starsZig

Valkyr LM Inference with Realtime Guarantees

Name: Valkyr LM Inference with Realtime Guarantees
Availability: InStock
Author: quatonion

by quatonion·May 3, 2026·3 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryNiche Gem

Pure Vulkan compute enables LLMs inside game loops without CUDA lock-in.

Strengths

•Single SPIR-V binary runs on NVIDIA, AMD, Intel, and Apple Silicon without backend duplication.
•Cooperative compute integration allows inference to share queues and command pools with render passes.
•First public demonstration of TurboQuant KV-cache compression on a non-CUDA backend.

Weaknesses

•Young project with months of development versus years for mature runtimes like llama.cpp.
•Vulkan compute performance may lag behind vendor-optimized CUDA kernels for raw throughput.

Post Description

Valkyr is a fresh take on LM Inference runtimes.

It's quite different from llama.cpp, vLLM, or ZINC for example.

Written in Zig, and super clean, with no dependencies apart from Vulkan Compute itself of course.

While it supports multiple model families and features like an OpenAI compatibility server, a big thing is you can run it with strict time budgets running inside an existing Vulkan host.. think video games, AR/VR apps, edge devices, or robots.

In addition it supports a rich probe interface to research language model internals at close to real time.

I'm not trying to take on existing runtimes, those exist already. What I am trying to do is help inference work cooperatively with time sensitive applications.

Would love it if you checked it out.

Please let me know if you have any questions or ideas. Thanks!

Similar Projects

Infrastructure●●Solid

Go LLM inference with a Vulkan GPU back end that beats Ollama's CUDA

28% faster Vulkan-to-CUDA on Qwen, but llm.c and llama.cpp already own inference.

WizardryBig BrainNiche Gem

computerex

103mo ago

AI/ML●●Solid

Doppler.js – WebGPU inference, faster/simpler than transformer.js

Explicit kernel control over TVM-style black boxes, but benchmarks show mixed wins vs Transformers.js.

Big BrainWizardry

clocksmith

303mo ago

Developer Tools●●Solid

HeatSpectra: Realtime 3D Surface Heat Simulation

Meshless Voronoi discretization running realtime on Vulkan is genuinely clever.

WizardryNiche Gem

tsun3doku

601mo ago

Infrastructure●●●Banger

Evolved x86 AVX-512 kernels for NF4 LLM inference

Genetic algorithm evolves x86 kernels; runs 80B MoE on single GPU with CPU offload.

WizardryBig BrainNiche Gem

Anuar81

203mo ago

AI/ML●●Solid

Talu, single-binary, local-first LLM runtime

Someone rebuilt an inference stack from the ground up in Zig and shipped it as a single binary — including Python bindings, built-in quantization (4/8-bit grouped affine schemes), embeddings, and a plugin-friendly web UI. It’s technically ambitious and immediately useful for anyone wanting local model routing and compact quantized workflows, though GPU support (CUDA) is still on the roadmap and the space is crowded with established alternatives.

WizardryNiche Gem

aprxi

204mo ago

Infrastructure●●●Banger

MQTT with Runtime Protobuf and Service Discovery (Zig)

Broker-enforced Protobuf schemas shrink payloads 4x, runs in 2.6 MB with service discovery baked in.

WizardryBig BrainNiche Gem

gyokhan

203mo ago