Nvshmem from Scratch – RDMA, GPUDirect, and GPU Networking Demystified
NVSHMEM from scratch with RDMA, PCIe topology, GPUDirect RDMA, CUDA IPC—demystifies GPU networking internals.

This reads like a GPU engineer's field notes — one ~3,400-line CUDA file implements a full per-thread crypto pipeline (key gen → EC multiply → SHA-256 → RIPEMD-160) and a two-stage bloom+binary-search matcher to check ~3,100 targets at ~100M keys per batch. The article digs into concrete low-level choices (LUT layout, memory hierarchy, __ldg reads, atomicCAS reporting, and per-mode keygen strategies), which is rare in public writeups; downside is it's closed-source and the dual-use/ethical implications should be called out more explicitly.
Crypto/security researchers, GPU programmers, digital forensics and blockchain investigators
The problem: Between 2009 and 2012, many Bitcoin wallet tools used predictable entropy — timestamp-seeded LCGs, brain wallets with simple passwords, the Debian OpenSSL bug (CVE-2008-0166) that reduced entropy to 15 bits, the Randstorm BitcoinJS flaw, PHP's mt_rand() seeded with time*pid, and others. The private keys generated by these algorithms have tiny search spaces (some as small as 65,536 candidates), and the addresses are still on the blockchain with funded balances.
The engineering: Everything lives in a single ~3,400-line CUDA C++ file. The per-thread pipeline is:
private key → secp256k1 EC multiply → SHA-256 → RIPEMD-160 → bloom filter → binary search → match
Some decisions that might interest the GPU computing crowd:
- EC multiplication via precomputed lookup tables (67 MB): The 256-bit private key is split into 16 × 16-bit chunks, each indexing into a precomputed table of generator point multiples. This reduces scalar multiplication from ~256 double-and-add iterations to 16 lookups + 15 point additions. - Specialized SHA-256 kernels per pubkey format: Instead of a generic SHA-256, we have two dedicated functions — one for 33-byte compressed pubkeys (single block, partially precomputable message schedule) and one for 65-byte uncompressed (two blocks, second block is constant padding). Both build the message schedule directly from EC point coordinates without a serialization buffer. - 32 KB bloom filter in __constant__ memory: With ~3,100 target addresses and 100M candidates per batch, a naive binary search would thrash global memory. A bloom filter with 7 hash functions (FNV-1a + rotate-XOR double hashing) in constant memory rejects 99.9999% of candidates with broadcast cache reads. Only ~100 false positives per batch reach the binary search confirmation stage in global memory. - 256-bit modular arithmetic in PTX inline assembly: secp256k1 operates over a 256-bit prime field, implemented as eight 32-bit limbs with carry propagation. The sparse form of the secp256k1 prime (2^256 - 2^32 - 977) allows fast reduction. - __launch_bounds__(256, 2) + --maxrregcount=128: Targeting 2 blocks per SM with 128 registers per thread. Higher occupancy isn't better here because the kernel is compute-bound with heavy ALU usage — 2 blocks provides enough ILP to hide latency without forcing register spills. The Mersenne Twister mode (2.5 KB state per thread) is the exception and occasionally spills.
There are 23 vulnerability modes total, covering LCG PRNGs (glibc, MSVC, Borland Delphi, Java), brain wallet patterns (SHA-256 of counters, timestamps, phone numbers, short ASCII strings), MT19937, V8 JavaScript PRNGs, and the Debian OpenSSL disaster. The smaller modes (Debian: 65K candidates) complete in milliseconds; the largest (SHA-256 of ASCII strings 1-8 chars: ~2T) takes days across multiple GPUs.
The distributed side uses a FastAPI backend that assigns work units, verifies results via deterministic checkpoint regeneration (server independently recomputes the key from mode+offset using Python), and injects canary targets (honeypot hashes where the server knows the private key) to detect cheating workers.
Happy to answer questions about the CUDA optimization, the secp256k1 implementation, or the weak entropy patterns.
NVSHMEM from scratch with RDMA, PCIe topology, GPUDirect RDMA, CUDA IPC—demystifies GPU networking internals.
Direct2D GPU PDF renderer with CPU fallback, but alpha-stage and Windows-only.
Well-made interactive essay, but the weak-link economics argument isn't novel.
Infinite canvas terminal with minimap—tmux tabs but actually visible.
Real talk on the 'secret final boss' of building agents: the hostile web.
Linux finally gets offline voice typing; Ctrl-tap + Vulkan GPU support vs cloud-dependent alternatives.