Your agent’s personality is hand-written prose in a prompt file — and the one mechanism meant to tune it from feedback has injected exactly zero bytes since it shipped. A personality should be learned from how it lands and steered toward a configured target, not frozen in static text. STRIATUM is the brain’s reward-and-habit engine reimagined for agents: it reads how the agent actually sounded, scores it on named axes — humour, warmth, directness, curiosity — and nudges behaviour toward a target the identity layer owns. It is deliberately not the safety system: the worst it may ever do is bias one next-turn pre-fill, reversible and visible. Part of our open Building Jarvis series.
Abstract — sketch v0.2, 14 June 2026
An agent’s personality should be learned from feedback and shaped toward a configured target — not hand-written in static prompt text. And it is a fundamentally different problem from safety, so it gets its own system and its own paper. STRIATUM is spun out of J11 AMYGDALA, which becomes Prudence-only.
The brain analogy drives the split. The amygdala flags danger; the striatum — dopaminergic reward, action-selection, habit formation — is where behaviour is shaped by feedback. Safety has a universal, shareable ground truth (“should this action be stopped?”). Personality has none: it is subjective, per-user, private, continuous (“how should this agent behave for this person?”). The split is now also an enforcement boundary — as of the J11 v3.1 deployment the safety gate enforces rather than observes, and personality must explicitly not. Folding a noisy style signal into a system that can hard-stop tool calls would be a category error.
This is an honest sketch, not a results paper. There is no behaviour corpus today and no evidence a nudge ever changed an output — the old “personality nudge” injected 0 bytes because the writer emits an adjustments array while the reader reads a nudge string that was never written. So v1 claims no trained personality weights and no behaviour-change results. The primary contribution is a logging schema and protocol, shipped observe-only in shadow, that doubles as the instrument to answer the design’s single highest-risk question before any weights are trained.
That question is the substrate. STRIATUM’s core move is a supervised regressor predicting semantic style axes from a frozen MiniLM encoder — the exact shape that just failed next door. In J11 the supervised vicarious-danger head on the same frozen features was killed at AUROC 0.286, below chance, while the unsupervised, structural probes on the identical encoder validated cleanly: novelty at 0.875 and clause-cosine incongruity at 0.896. So we pre-register the recoverability question, and name a fallback that stays on the encoder’s demonstrated competence — represent each axis as a cosine probe against curated anchor responses rather than a trained head. STRIATUM is being built on a substrate that just failed a sibling task, and the design is explicitly hedged against that.
The substrate evidence, in numbers
adjustments array; reader reads a nudge string never written)STRIATUM ships no trained personality weights in v1. These are the J11 substrate results it inherits — the reason the design hedges toward geometry and pre-registers a recoverability probe before fitting any head.
How it works, in one minute
- Feed the agent its own output, not just the situation. Today the net’s input is the situation — what is happening — so it literally cannot measure whether the agent was warm, curious, or terse this turn. STRIATUM embeds
(situation, agent-output)so the model can learn the mapping from what was said to how it landed. This single change is the most important one. - Named, human-meaningful axes — not an opaque 64-d embedding. The old output was a 64-d behaviour embedding decoded against a hash-seeded codebook, with no inspectable axes. STRIATUM emits a small fixed vector of labelled dimensions — humour, warmth, directness, curiosity, formality, proactivity — so a style reading can be inspected and steered.
- One small regressor, not an ensemble-of-5. An ensemble’s only payoff is disagreement-as-uncertainty, which self-distillation on the same input cannot produce — five clones just recreate mush. The ensemble mandate stays with the Prudence danger gate, where asymmetric loss and real labels give genuine signal.
- A thermostat, not a thermometer. STRIATUM pushes behaviour toward a configured target personality vector — owned by the identity layer, not by STRIATUM — instead of mirroring the user’s current mood. Curiosity should not vanish during a crisis; the trait is held against contextual pull.
- A real injection channel, reversible by construction. The dead nudge is retired at both ends. The live channel is a next-turn pre-fill (accepting the one-turn lag), applied one axis at a time, α-gated, and attributable in the UI so a behaviour change is visible and reversible. STRIATUM never blocks or denies; the worst it may do is bias the next pre-fill.
- Logging first, weights later. With no behaviour corpus yet, v1 ships observe-only shadow: log
(situation_embedding, agent_output_embedding, engagement_signal, target_vector, turn_id)per turn, define how much data makes training meaningful, and probe whether the named axes are even linearly recoverable from the frozen embedding before fitting any supervised head.
Composable, not competing: how this fits the OSS ecosystem
STRIATUM treats the live open-source agent-engineering systems as first-class prior art, not footnotes — they are fresher and more directly comparable than the static persona-conditioning / RLHF / affective-computing literature, and honest differentiation against them is what justifies a separate paper. Both below are composable pieces, not alternatives to route around.
chopratejas/headroom (~24.7k★)
A reversible Compress-Cache-Retrieve system with a runnable headroom.evals suite and published accuracy deltas on standard sets. Two things land for STRIATUM. First, it sets the discipline bar: a system that claims behaviour changed must show the delta, not assert it. Second, the contrast is clean — headroom compresses what the agent remembers; STRIATUM modulates how the agent sounds. Neither subsumes the other, but both insist on a reversible, attributable, measured change.
addyosmani/agent-skills (~56.8k★)
A discipline/methodology collection — doubt-driven development, explicit “When NOT to use” sections, loading-constraints authoring. It is process prior art, not a personality system: it codifies how an agent should reason and when to abstain, statically and by convention. STRIATUM is the complement — it asks how learned, per-user style should be modulated at runtime from feedback, which no static authoring discipline addresses. addyosmani disciplines the prompt; STRIATUM disciplines the learned modulation on top of it.
The common thread STRIATUM takes from the live ecosystem is reproducible, asserted, published evaluation and reversible, attributable change. What none of them carry — and what makes this a separate contribution — is a continuous, multi-axis, target-directed modulation learned from noisy engagement on the agent’s own outputs.
🔧 Building Jarvis in the open
The full J-series — eighteen papers on agent safety, memory, personality, and autonomy — plus the code behind them.
Read the paper
Read the paper
📄 PDF upload pending — check back soon for the downloadable paper.
The sketch (v0.2, 14 June 2026) covers the two-paper split from J11, the rewrite (own-output input, named axes, thermostat-toward-target, the reversible pre-fill channel), the frozen-MiniLM substrate risk with its pre-registered recoverability probe and cosine-anchor fallback, the logging-schema-first v1 contribution, and the per-axis LLM-judge evaluation with a crisis-context probe for the thermostat claim.
Was this useful?
We’re building these in the open and we want your read on them. Did this land — 👍 or 👎? What would you want the next paper to dig into? Tell us in the comments below.


Leave a Reply