Back to browse
GitHub Repository

The inference engine the open-source world built for itself.

151 starsPython

ZSE – Single-file LLM engine with dual INT4 kernels

by zyoralabs·Mar 2, 2026·1 point·0 comments

AI Analysis

●●SolidWizardryShip It

INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.

Strengths
  • Native INT4 kernels with 12-14% VRAM savings over bitsandbytes
  • Single-file .zse format with embedded tokenizer eliminates setup complexity
  • Measured cold starts (5.7s for 7B) useful for serverless workloads
Weaknesses
  • Crowded inference space: vLLM, TGI, llama.cpp, Ollama already solve this
  • No evidence of architectural novelty over quantization + CUDA kernels
Target Audience

ML engineers, serverless deployments, edge inference

Similar To

llama.cpp · vLLM · Ollama

Post Description

ZSE is an LLM inference engine built for simplicity and efficiency.( Phase 1 )

Features: - Single .zse file format (model + tokenizer + config embedded) - Zero network calls on load - works completely offline - Dual INT4 kernel backend (ZSE Kernel + ZSE bnb Kernel) - Intelligent layer auto-selects optimal kernel for your hardware - Fast cold starts for serverless deployments

Benchmarks (H200, Qwen 2.5):

ZSE Kernel: 7B → 5.67 GB VRAM, 37 tok/s, 5.7s cold start 14B → 10.08 GB VRAM, 21 tok/s, 10.5s cold start 32B → 19.47 GB VRAM, 11 tok/s, 20.4s cold start 72B → 41.54 GB VRAM, 6 tok/s, 51.8s cold start

ZSE bnb Kernel: 7B → 6.57 GB VRAM, 46 tok/s, 6.0s cold start 14B → 11.39 GB VRAM, 28 tok/s, 7.1s cold start 32B → 22.27 GB VRAM, 20 tok/s, 20.8s cold start 72B → 47.05 GB VRAM, 16 tok/s, 53.0s cold start

Usage: pip install zllm-zse zse convert Qwen/Qwen2.5-7B-Instruct -o model.zse zse serve model.zse --port 8000

https://github.com/Zyora-Dev/zse

Similar Projects

AI/ML●●●Banger

Glq LLM quantization using E8 lattice

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

WizardryBig Brain
acd
2018d ago
AI/ML●●Solid

Mamba SSM in Rust – training and inference with custom CUDA kernels

Custom CUDA kernels for SSM recurrence with zero framework dependencies.

WizardryNiche Gem
silvermpx
102mo ago