GitHub Repository

A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.

24 starsJavaScript

PureBee – A software-defined GPU running Llama 3.2 1B at 3.6 tok/SEC

Name: PureBee – A software-defined GPU running Llama 3.2 1B at 3.6 tok/SEC
Availability: InStock
Author: benryanx

by benryanx·Feb 23, 2026·3 points·6 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBig BrainZero to One

Pure math beats silicon: full LLM inference via auditable WASM+SIMD, zero compiler toolchain.

Strengths

•45× speedup progression from first principles (JS→WASM→Q4→SIMD→threads) is transparent and reproducible.
•Genuinely auditable stack: every layer including matmul is readable source, no blackbox Emscripten.
•Runs Llama 3.2 1B full inference, not a toy—verifiable outputs on Node.js alone.

Weaknesses

•3.6 tok/sec is ~100× slower than llama.cpp on GPU; niche use case (education, verification, offline).
•No batching, streaming, or multi-model support yet—single-core constraint limits production utility.

Post Description

This started as a question about simulation theory: if a GPU is just rules applied to a grid in parallel, do you actually need the silicon?

Turns out, no.

PureBee is a complete GPU defined as a software specification — Memory, Engine, Instruction Set, Runtime. It runs Llama 3.2 1B inference at 3.6 tok/sec on a single CPU core. The model answers questions correctly.

What makes it different from llama.cpp or WebLLM:

The WASM compute kernel is constructed byte-by-byte in JavaScript at runtime. No Emscripten. No Rust. No compiler. No build step. The binary that runs the Q4 SIMD matrix math is itself readable JavaScript. Every layer of the stack — including the thing executing the math — is auditable source.

The progression from first principles:

```

Baseline JS 0.08 tok/sec

Typed arrays 0.21 tok/sec

WASM kernels 0.70 tok/sec

Q4 quantization 1.30 tok/sec

SIMD 3.00 tok/sec

Worker threads 3.60 tok/sec

```

45× total. Single CPU core. Zero npm dependencies.

The claim isn't that this is faster than a real GPU. The claim is that a GPU was never the hardware — it was always the math. The hardware is just one way to run the math fast. PureBee is another way. If that's true, it changes where inference can run.

To run:

```

git clone https://github.com/PureBee/purebee

node download.js llama3

node --max-old-space-size=4096 chat-llama3.js

```

Requires Node.js ≥ 20. The heap flag is not optional.

Licensed FSL-1.1 (converts to Apache 2.0 in 2 years). Free for personal and internal use.

Happy to go deep on the WASM binary construction, the Q4 nibble layout, or the SharedArrayBuffer weight cache that runs a 4.5GB model in 1.8GB of RAM.

Similar Projects

Developer Tools●Mid

Math-spec-driven LLM skill for complex system specs

Math-spec approach for LLM-generated code, but lacks working examples and doesn't solve the reasoning-accuracy problem.

Big Brain

ben8t

104mo ago

AI/ML●●●Banger

TurboPrefill – Multi-GPU prefill acceleration for llama.cpp

2x prefill speedup on 12k+ token contexts by treating GPUs like a production line.

Big BrainWizardry

trykhlieb

2018d ago

AI/ML●●Solid

Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

33x speedup over mmap for 70B on RTX 3090, but still 0.2 tok/s vs vLLM's 30+ tok/s.

WizardryDark Horse

xaskasdf

3951014mo ago

Finance●●Solid

I wrote an open source SEC filing compliance package

Open-source SEC XML filer replaces expensive proprietary software like Workiva for teams.

Niche GemSolve My Problem

jgfriedman1999

103mo ago

Infrastructure●Mid

ChangeSpec is an open specification for software change communication

Standardizing change logs is noble, but adoption requires solving the network effect.

Bold Bet

cdnsteve

601mo ago

Data●●●●Gem

Cuckoo-GPU – A 350x faster Bloom filter alternative for GPUs

350x faster GPU Bloom filter with academic paper backing the performance claims.

WizardryBig BrainDark Horse

tdortman

113mo ago