GitHub Repository

A language model that forms persistent memories from conversation and maintains them through sleep. MEMIT weight editing + null-space-constrained maintenance.

64 starsPython

Sleeping LLM – A language model that remembers by sleeping

Name: Sleeping LLM – A language model that remembers by sleeping
Availability: InStock
Author: vbaranov87

by vbaranov87·Feb 26, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBig BrainZero to One

Direct weight editing for persistent memory—MEMIT meets LoRA consolidation with null-space math.

Strengths

•Rigorous neuroscience grounding (complementary learning systems theory, not just RAG heuristics)
•Solves a hard constraint: identifies sharp phase transition at 13-14 facts on 8B model, not gradual decay
•Novel insight on alignment tax (RLHF suppresses injected knowledge) with measured inverse scaling across model sizes

Weaknesses

•Severe capacity ceiling (30 facts on 70B requires 4 sleep cycles; real-world workflows often need 100+)
•Requires H100s for competitive models; MacBook M3 maxes out at 15 facts with 0.60 recall

Post Description

I built a system that gives LLMs persistent memory from conversations — not through RAG or databases, but by editing the model's actual weights. The knowledge lives in the parameters. The context window is empty.

During wake, facts from conversation are injected directly into MLP weights via MEMIT (a single forward pass, instant recall). During sleep, the system audits which memories degraded, refreshes them with null-space constraints (guaranteeing orthogonality to working memories), then progressively transfers knowledge into LoRA — like biological memory consolidation from hippocampus to neocortex.

The key problem was a hard capacity ceiling: the 8B model sustains 0.92 recall up to 13 facts, then crashes to 0.57 at fact 14 — a sharp phase transition, not gradual decay. And LoRA consolidation was blocked by what I call the "alignment tax": RLHF training fights back against injected knowledge (37% recall loss on 8B from a single LoRA pass).

The fix: per-fact graduated consolidation. Each fact independently tracks its own stage and advances only when LoRA proves it absorbed that specific fact. A dissolution schedule (1.0 → 0.5 → 0.1 → 0.0) gradually removes the MEMIT edit as LoRA takes over. And cumulative fusing — training each cycle on the already-fused model — reduces the alignment tax from catastrophic to negligible (starting loss drops 2.91 → 0.62 by cycle 2).

Results on Llama 3.1 8B (4-bit, 2×H100): - 100% advancement rate at 5/10/15/20 facts - 1.00 chat recall at all scales - MEMIT edits dissolve on schedule, making the buffer renewable - Effective lifetime capacity: unbounded

There's also a biological curiosity: individual facts consolidate at different rates. One synthetic fact ("Aria lives in Portland") is consistently the hardest across very run — some memories are just harder to absorb, same as in biological systems.

6 papers documenting the full journey from initial LoRA prototype to this result: https://doi.org/10.5281/zenodo.18779159

Built with: Python, PyTorch, PEFT, BitsAndBytes, Llama 3.1. Runs on MacBook Air (3B) or H100 (8B/70B).