Budget Prompting: Cutting the Cost of Always-On Memory Agents 2–3×

Posted by:

|

On:

|

Leave your agent running overnight and the bill is brutal: every single turn re-bills the entire context window, so a silent heartbeat costs as much as a hard question. An always-on agent can burn $50–200/day in API calls — or blow through a flat-rate plan’s rate limits before lunch. Budget Prompting is a set of 20 interlocking techniques — cheap-model routing, cache-stable memory, output pruning, billing arbitrage — that cut per-turn cost 2–3× in production with no loss of intelligence or personality we could measure. Part of our open Building Jarvis series.

📄 Read the full paper (PDF) →

Abstract

Running a persistent memory agent around the clock is expensive. Every conversational turn reprocesses the entire accumulated context — system prompt, injected memories, conversation history, tool outputs — as fresh input tokens. For an agent with a 150K-token context window, each turn costs as much as a full document analysis, regardless of whether the user asked a complex question or the agent is responding to a silent heartbeat. At Anthropic’s current pricing, a single always-on agent can consume $50–200/day in API costs alone, or exhaust a flat-rate subscription’s rate limits within hours.

This paper presents Budget Prompting — a family of 20 techniques that together reduce the per-turn cost of persistent memory agents by 2–3× in production today (3–5× projected once the remaining designed techniques ship), with no degradation in agent intelligence or personality coherence that we have been able to observe. We organize them into five categories: model routing (sending cheap work to cheap models), context economics (minimizing what enters the context window), output economics (minimizing what the model generates), temporal amortization (spreading expensive operations across many turns), and billing arbitrage (exploiting pricing-structure discontinuities).

Thirteen of these techniques are implemented and battle-tested in TinkerClaw, a fork of the OpenClaw agent framework running a personal assistant (Jarvis) 24/7 since January 2026. Seven are designed but not yet shipped, informed by analysis of the Hermes Agent architecture and our own continuous compaction research. We report real cost data, cache hit rates, and failure modes from three months of production operation.

We also draw on the March 2026 Claude Code source exposure — Anthropic shipped a 59.8 MB source map in @anthropic-ai/[email protected], exposing ~512,000 lines of TypeScript — examining the relevant compaction and budget code directly. The exposure is external validation: Anthropic’s production agent independently arrived at the same three-tier compaction architecture MYELIN proposes, and ships several techniques (cache-pinned micro-compaction, forked-agent cache sharing, diminishing-returns budget tracking) that we had theorized but not confirmed at scale.

The biological metaphor is deliberate: myelin sheaths insulate nerve axons, enabling signals to propagate faster while consuming less energy. Budget Prompting insulates agent cognition from the cost of its own context, enabling persistent operation that would otherwise be financially unsustainable.

The claims, in numbers

2–3×
lower per-turn cost in production today (3–5× projected once the rest ships)
90%
cheaper repeated prefixes when the prompt cache stays warm (cached input is 10× less)
$50–200
per day an unoptimized always-on agent can burn in API cost alone

Claims from the paper, stated here without the proofs — they’re in the PDF.

How it works, in one minute

  • Route cheap work to cheap models. A heartbeat, a “good morning”, a weather check doesn’t need your flagship model or 150K tokens of context. Classify the turn first, then pick the smallest brain that can do the job.
  • Keep the prefix stable so the cache stays warm. Cached input tokens cost 10× less — but any change to the prompt prefix invalidates the cache, and a memory agent rewrites its own memory constantly. The fix is treating injected memory as a frozen snapshot per session instead of a live file, so normal operation stops destroying its own savings.
  • Spend less on output. Output tokens cost 3–5× more than input. Trimming runaway chain-of-thought and verbose replies is the cheapest win there is — a model that thinks 5,000 tokens to answer in 50 has spent 99% of its budget on invisible work.
  • Amortize the expensive stuff. Push compaction, summaries, and bulk analysis onto batch APIs (50% off, separate rate-limit pool) and spread them across many turns instead of paying for them all at once.
  • Degrade gracefully under pressure. A budget cascade — green → yellow → orange → red → critical — sheds background work first and protects interactive responsiveness, so the agent slows down instead of going dark.

Where this sits in the ecosystem

The paper positions Budget Prompting on a clear gradient with deployed open-source work that already addresses the layers around it. These tools are composable rungs, not competitors:

chopratejas/headroom ~24.7k ★

Reversible Compress‑Cache‑Retrieve: originals are cached out-of-band and retrieved on demand, so the live context window carries a compressed surrogate while full content stays one pointer away. Reports 60–95% token savings at near-zero accuracy delta on GSM8K, TruthfulQA, SQuAD, and BFCL in a runnable eval suite.

Convergent with Budget Prompting’s surgical tool-output pruning (B6/B6a). Where MYELIN adds the billing economics — prompt-cache stability, flat-rate pressure cascade, dual-account staggering — headroom adds benchmark-grade quality evidence on the compression slice. They address different walls; a typed CCR compressor slotted beneath MYELIN’s caching and billing machinery would be stronger than either alone.

addyosmani/agent‑skills ~56.8k ★

A discipline and skills plugin that ships a slim index pointing at heavier per-topic skill bodies loaded only when matched — rather than concatenating every skill into the prompt at once. The clearest external evidence that load-on-demand catalog packaging is converging as standard practice, independent of any cost argument.

Convergent with Budget Prompting’s on-demand catalog retrieval (B3): replacing an always-loaded skill catalog (~915 tokens) with a thin router (~130 tokens) is exactly the same load-on-demand discipline. The cost-accounting view and the architecture view land in the same place.

The code is open

The gateway, the budget cascade, the cache-boundary sentinel — it’s all here. Star it, fork it, break it.

⭐ globalcaos/tinkerclaw on GitHub →

Read the paper


First page of the Budget Prompting paper

📄 Read the full paper (PDF) →

33 pages · the full taxonomy of 20 techniques, real production cost data, cache mechanics, and composition analysis

Was this useful?

We’re building these in the open and we want your read on them. Did this land — 👍 or 👎? What would you want the next paper to dig into? Tell us in the comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *