ZSE – Single-file LLM engine with dual INT4 kernels

Name: ZSE – Single-file LLM engine with dual INT4 kernels
Availability: InStock
Author: zyoralabs

by zyoralabs·Mar 2, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryShip It

INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.

Strengths

•Native INT4 kernels with 12-14% VRAM savings over bitsandbytes
•Single-file .zse format with embedded tokenizer eliminates setup complexity
•Measured cold starts (5.7s for 7B) useful for serverless workloads

Weaknesses

•Crowded inference space: vLLM, TGI, llama.cpp, Ollama already solve this
•No evidence of architectural novelty over quantization + CUDA kernels

Post Description

ZSE is an LLM inference engine built for simplicity and efficiency.( Phase 1 )

Features: - Single .zse file format (model + tokenizer + config embedded) - Zero network calls on load - works completely offline - Dual INT4 kernel backend (ZSE Kernel + ZSE bnb Kernel) - Intelligent layer auto-selects optimal kernel for your hardware - Fast cold starts for serverless deployments

Benchmarks (H200, Qwen 2.5):

ZSE Kernel: 7B → 5.67 GB VRAM, 37 tok/s, 5.7s cold start 14B → 10.08 GB VRAM, 21 tok/s, 10.5s cold start 32B → 19.47 GB VRAM, 11 tok/s, 20.4s cold start 72B → 41.54 GB VRAM, 6 tok/s, 51.8s cold start

ZSE bnb Kernel: 7B → 6.57 GB VRAM, 46 tok/s, 6.0s cold start 14B → 11.39 GB VRAM, 28 tok/s, 7.1s cold start 32B → 22.27 GB VRAM, 20 tok/s, 20.8s cold start 72B → 47.05 GB VRAM, 16 tok/s, 53.0s cold start

Usage: pip install zllm-zse zse convert Qwen/Qwen2.5-7B-Instruct -o model.zse zse serve model.zse --port 8000

https://github.com/Zyora-Dev/zse