Math-spec-driven LLM skill for complex system specs
Math-spec approach for LLM-generated code, but lacks working examples and doesn't solve the reasoning-accuracy problem.
A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.
Pure math beats silicon: full LLM inference via auditable WASM+SIMD, zero compiler toolchain.
Systems programmers, compiler researchers, AI inference developers interested in portable compute primitives
llama.cpp · WebLLM · ONNX Runtime
Turns out, no.
PureBee is a complete GPU defined as a software specification — Memory, Engine, Instruction Set, Runtime. It runs Llama 3.2 1B inference at 3.6 tok/sec on a single CPU core. The model answers questions correctly.
What makes it different from llama.cpp or WebLLM:
The WASM compute kernel is constructed byte-by-byte in JavaScript at runtime. No Emscripten. No Rust. No compiler. No build step. The binary that runs the Q4 SIMD matrix math is itself readable JavaScript. Every layer of the stack — including the thing executing the math — is auditable source.
The progression from first principles:
```
Baseline JS 0.08 tok/sec
Typed arrays 0.21 tok/sec
WASM kernels 0.70 tok/sec
Q4 quantization 1.30 tok/sec
SIMD 3.00 tok/sec
Worker threads 3.60 tok/sec
```
45× total. Single CPU core. Zero npm dependencies.
The claim isn't that this is faster than a real GPU. The claim is that a GPU was never the hardware — it was always the math. The hardware is just one way to run the math fast. PureBee is another way. If that's true, it changes where inference can run.
To run:
```
git clone https://github.com/PureBee/purebee
node download.js llama3
node --max-old-space-size=4096 chat-llama3.js
```
Requires Node.js ≥ 20. The heap flag is not optional.
Licensed FSL-1.1 (converts to Apache 2.0 in 2 years). Free for personal and internal use.
Happy to go deep on the WASM binary construction, the Q4 nibble layout, or the SharedArrayBuffer weight cache that runs a 4.5GB model in 1.8GB of RAM.
Math-spec approach for LLM-generated code, but lacks working examples and doesn't solve the reasoning-accuracy problem.
2x prefill speedup on 12k+ token contexts by treating GPUs like a production line.
33x speedup over mmap for 70B on RTX 3090, but still 0.2 tok/s vs vLLM's 30+ tok/s.
Open-source SEC XML filer replaces expensive proprietary software like Workiva for teams.
Standardizing change logs is noble, but adoption requires solving the network effect.
350x faster GPU Bloom filter with academic paper backing the performance claims.