Why Your Pre-Push Privacy Gate Is Lying to You — and the Recipe as the Missing Middle Layer

Posted by:

|

On:

|

Your agent’s pre-push privacy gate is green on every commit — and your history is still leaking. A diff-scoped gate proves no single push makes things worse; it never proves the cumulative state is clean. This paper shows the exact failure (every push passed, the merged history leaked 16 times), the fix, and the missing middle layer every AI-pair codebase needs: the recipe — a named, replayable, version-controlled unit between a one-shot prompt and a brittle script. v0.3.8 adds a proven cross-registry interchange format and honest positioning against the live open-source ecosystem. Part of our open Building Jarvis series.

📄 Read the full paper (PDF) →

Abstract

This paper carries two arguments. The first is structural: a privacy gate scoped to each push proves that no single push makes history worse, but it does not prove the cumulative history is clean — a diff-based contract is necessary but not sufficient for any contract that must hold over absolute state. A real leak demonstrated this. The fix names the missing axis (privacy dominates functionality) as a design principle and hardens the pre-push gate to scan accumulated drift, not just the push range.

The second argument is about abstraction: AI-pair coding lacks a middle layer between source code and a single prose mental model, and that gap exists for behavior as much as for documentation. The fork ships an orchestration substrate of recipes (on disk: kits) — the named, replayable, gated unit between a one-shot prompt (a wish) and a script (brittle code). The same primitives that make a self-documenting codebase enforce itself make the recipe library enforce itself.

Two claims are deliberately held to what the live code supports: the composition algebra has both halves shipped and tested — uses: (runtime sub-recipe invocation) and composes: (inline step-merge at plan-build time), each depth- and cycle-guarded; and recipe matching is lexical by default with a semantic fallback lane gated default-OFF, not a default-semantic match. A third position is now defensible against the live ecosystem: the recipe’s on-disk schema (kit/1.0) is a cross-registry interchange format, not a fork-local convention — the same header is published by an independent registry (Journey) and parses and runs in our loader, completing the function/module-library analogy across project boundaries.

The honest residue is also named: every gate the recipe layer carries is structural, and adjacent open-source systems (addyosmani/agent-skills, marketingskills, headroom) carry contracts we do not yet — negative-space anti-triggers, load-context constraints, and runnable behavioral evals — which the paper positions as related work and queued future work rather than absorbing into an inflated maturity claim.

The claims, in numbers

16
privacy hits in the merged history — after every single push passed its own gate
50+
commits accumulated the leak while every per-push scan reported zero
38
replayable recipes shipped, one per task class, composition bounded at depth 3

Claims from the paper, stated here without the proofs — they’re in the PDF.

How it works, in one minute

  • The trap: per-delta ≠ cumulative. A gate that scans only the push range (remote..local) proves this push adds nothing bad. It says nothing about the state the main branch would inherit — old lines carried forward verbatim, or edits that preserved the token, slip through every individual scan.
  • The fix: scan two scopes. Keep the per-push scan, and add a second scan of the net diff against main (origin/main..HEAD). Net-diff semantics matter: an add-then-remove inside your own branch cancels, so you flag only what main would actually inherit, not everything the branch ever touched.
  • Privacy dominates functionality. A failed functional check is reversible — fix it next session. A privacy token in committed history is irreversible. Asymmetric consequence demands asymmetric priority, so the privacy gate is a hard precondition for every push, not a co-equal axis you can trade against a deadline.
  • The recipe: a named middle layer. Between a one-shot prompt (cheap, unreplayable, no contract) and a script (brittle, goes stale when an RPC name changes) sits a recipe — a named, version-controlled workflow in the model’s own idiom, replayed step-by-step instead of re-derived. A recipe is to a prompt what a function is to inline code.
  • Recipes compose in three ways, and grow at the point of need. uses: calls a sub-recipe at runtime; composes: inlines another recipe’s steps at plan-build time; invoke skill: calls a typed stdlib primitive inline — all three depth- and cycle-guarded so the call graph can’t run away. When a prompt finds no match, the system flags a recipe-gap and offers to author one on the spot, so the library grows exactly where it’s missing.
  • The schema travels: kit/1.0 is a cross-registry interchange format. The same on-disk header this loader reads is published by the Journey registry — a behavioral abstraction authored in one agent’s registry parses and runs in another’s. Interop is proven, not asserted.

How this sits in the open-source ecosystem

The paper engages four adjacent systems directly — Journey (the kit/1.0-native registry our loader shares a schema with) and coreyhaines31/marketingskills (~33k★, a 44-framework library fronted by a ~130-token router) — and names where we lead and where we trail against the two below. These are composable, not competing: same abstraction class, different strengths.

addyosmani/agent-skills ~56.8k ★

Same abstraction class — named, gated, replayable units for Claude Code plugins, the closest schema sibling. Carries negative-space contracts ours lack: “When NOT to use” anti-triggers and “Loading Constraints” load-context declarations. We carry the runtime composition algebra (depth- and cycle-guarded) and on-the-fly authoring at the point of need it does not. Composable, not competing.

chopratejas/headroom ~24.7k ★

A reversible-compression system whose relevance here is methodological: it ships a reproducible eval suite (the headroom.evals module) that reports a measured accuracy delta against its own claims. It is the model for the value-gate the paper names as our highest-priority next addition — the behavioral-eval surface our recipe gates, all structural today, do not yet carry.

The consistent shape: we are ahead on enforcement — no other system here has a verify:-gated governing doc, commit-anchored invariants, or a depth/cycle-guarded composition algebra — and behind on two contracts these systems already ship: negative-space applicability, and runnable behavioral evals. Naming both directions is the point.

The recipe layer, the gates, and the schema are open source.

TinkerClaw is our fork, anchored on OpenClaw — the kit/1.0 loader, the composition algebra, and the pre-push privacy gate all live in the open. Read the code behind every claim in this paper.

⭐ globalcaos/tinkerclaw on GitHub →

Read the full paper

Privacy-as-Load-Bearing, and the Recipe as Intermediate Abstraction — the full argument, the evidence map, and the line-anchored code references behind every claim above.

📄 Download the PDF →

Found a hole in the argument, or a contract we missed? That’s exactly the kind of feedback this series runs on — tell us what we got wrong. The honest residue is the most interesting part.

Leave a Reply

Your email address will not be published. Required fields are marked *