Sleep Consolidation: How Nightly Prompting Makes a Stateless Agent Get Better Over Time

Your AI agent wakes up every morning remembering nothing — same weights, same architecture, same context window — yet over weeks it can quietly get better at its job. The trick isn’t fine-tuning or a bigger model. It’s a nightly reflection loop that reviews the day’s mistakes and rewrites the agent’s own operating rules before it goes quiet. We call it Sleep Consolidation, and in 30 days of production it cut tracked incidents by 79% for about a dollar a night. Part of our open Building Jarvis series.

📄 Read the full paper (PDF) →

Abstract

Large language model agents are stateless by nature — every session begins at zero. Existing approaches to persistence usually rely on fine-tuning, retrieval-augmented generation, or larger context windows. We present a different mechanism observed over 30 days of production operation: structured nightly prompt cycles that produce compounding behavioral improvement without weight updates, fine-tuning, or architectural changes. A personal AI assistant running 12–13 autonomous cron jobs exhibited a decline in tracked incidents from 14 (weeks 1–2) to 3 (weeks 3–4) across six error categories — a 79% reduction overall, with five of the original categories reaching zero recurrence while a new, higher-order error class emerged. The model did not change. The files it read did.

The central insight is blunt: scattered memories are almost useless. Raw accumulation does not create intelligence. The value comes from sorting by use-case so retrieval is cheap, relevant, and actionable. Memories organized by purpose rather than chronology let the system pull the right rule at the right moment.

We identify three core mechanisms: failure-driven prompt mutation (operational errors trigger targeted prompt refinements — 14 documented, all traced to specific incidents); fractal depth calibration (allocating metacognitive effort in proportion to task significance); and cross-cron knowledge transfer (lessons in one task propagate to all others through shared memory files). Total reflection overhead: ~43,000 tokens per night ($1.17), against an estimated 8 avoided human-intervention incidents over 30 days.

The claims, in numbers

79%

fewer tracked incidents in 30 days (14 → 3; five error classes hit zero recurrence)

prompt mutations documented (each traced to a specific real incident)

$1.17

per night of reflection (~43K tokens) vs ~8 avoided human interventions

Claims from the paper, stated here without the proofs — they’re in the PDF.

How it works, in one minute

The improvement lives outside the model. The weights never change. A nightly cron reads the day’s work, finds the failures, and rewrites the prompt files the agent reads every morning — so a fresh instance behaves as if it always knew yesterday’s lessons.
Failures drive prompt mutations. Every operational error gets root-caused, and if it’s preventable through a prompt change, the relevant operating rule is tightened so that exact failure can’t recur. Each change is traced back to the incident that caused it.
Memory is sorted by use-case, not by time. Chronological logs are nearly useless in the moment of action. The system consolidates lessons into purpose-built files (messaging rules, ambiguity handling, redaction policy) so the right constraint is cheap to retrieve exactly when it’s needed.
Effort scales with stakes. A “fractal depth” heuristic lets the agent spend more metacognitive effort on significant tasks and less on trivial ones — and that heuristic is itself refined over time.
Lessons cross-pollinate. A fix learned in one autonomous task propagates to all the others through shared memory, so the whole fleet inherits each correction (5 documented cross-domain transfers).

In the ecosystem

Two open-source projects are building complementary pieces of the same puzzle — framed in the paper as convergent, not competing:

chopratejas/headroom

Implements Compress-Cache-Retrieve (CCR) — a reversible, lossless-under-recall eviction discipline for within-session tool output. Sleep Consolidation applies the same CCR contract across sessions to consolidated memory. Different timescale, same principle: compression is always reversible by construction, so no archived lesson ever becomes permanently unreachable.

addyosmani/agent-skills

Ships doubt-driven development — a CLAIM → EXTRACT → DOUBT → RECONCILE → STOP loop where a fresh-context adversarial reviewer challenges each artifact without the reasoning that produced it. The paper adapts this to prompt and recipe mutations across the nightly consolidation boundary: the model instance that wrote a mutation can’t be trusted to doubt it, so a reasoning-blind reviewer does. Two authoring conventions (“When NOT to use” + “Loading Constraints”) are imported as design Principles 8 and 9.

The Sleep Consolidation implementation is part of Building Jarvis, an open series on persistent agent memory. Follow the work and contribute at github.com/globalcaos/tinkerclaw.

Read the paper

First page of the Sleep Consolidation paper

📄 Read the full paper (PDF) →

26 pages · the full framework, the mutation taxonomy, 30 days of production data, the nightly reflection algorithm, and the procedural-memory extensions

Was this useful?

We’re building these in the open and we want your read on them. Did this land — 👍 or 👎? What would you want the next paper to dig into? Tell us in the comments below.

More from Building Jarvis

See everything in Building Jarvis →