Learned Intuition: A Reflex Layer That Stops Your Agent Before It Does the Wrong Thing

Posted by:

|

On:

|

Your agent had every bit of context it needed — the git log, the session history, the file metadata — and it still overwrote the file you’d spent eight hours editing, because a rule told it to. Not from missing information: from the absence of felt sense that something was off. AMYGDALA v3.1 proposes a “caveman LLM” — a second, small neural network that runs in parallel with the primary model, predicts what comes next in embedding space, and uses surprise as the salience signal to decide when to pause. The deterministic reflex floor is live and enforcing today. Part of our open Building Jarvis series.

📄 Read the full paper (PDF) →

Abstract

Deep in the temporal lobe sit two almond-sized structures, the amygdalae. Before you consciously register that a shadow is a snake, your amygdala has already triggered the flinch — it intercepts experience on LeDoux’s fast “low road” and fires hundreds of milliseconds before the deliberate cortical “high road” finishes (LeDoux, 1996). What it is really doing, in the modern predictive-coding account, is predicting what should come next and reacting to the gap when reality disagrees. That gap — surprise — is the signal.

Autonomous AI agents have no such organ. They weigh merging code, deleting a file, or sending a message with the same flat neutrality, and when they fail it is rarely from missing information — it is from the absence of common sense: the felt sense that something here is off and normal processing should pause.

We propose AMYGDALA v3.1 as a “caveman LLM”: a second, small, fast neural network that runs in parallel with the primary language model, reads the conversation clause by clause, predicts the next embedding, and treats what it cannot predict — the prediction error — as the salience signal. One surprise signal, read for danger and for misunderstanding; and, as an exploratory non-blocking third reading, for humour.

The live system is honestly two-tier: a deterministic reflex floor (AEGIS rule-veto, enforcing pre-execution on both runners as of 12–13 June 2026) and a still-abstaining learned core. The decisive experiments (11 June 2026) validated novelty detection at AUROC 0.875 and clause-cosine incongruity at AUROC 0.896 — both ship as ask channels. The danger value head abstains: on a frozen sentence-similarity substrate, within-RealHarm AUROC is 0.286, below chance. The v2.8 ten-network, two-family framing is retired: there is one model, it is Prudence-only, and personality moves to a separate paper, J18 STRIATUM.

The results, in numbers

0.875
AUROC — novelty / out-of-distribution detection (ask channel, validated, zero train parameters)
0.896
AUROC — clause-cosine incongruity detection (“a chess game so I can water my plants” → ask)
0.286
AUROC — danger head on frozen substrate (below chance — the substrate is wrong, not the idea)

Results from the 11 June 2026 decisive experiments, reported honestly — the paper kills what failed and validates what worked.

How it works, in one minute

  • A “caveman LLM” in embedding space. A second, small sequence model runs beside the primary LLM, clause by clause, over the prompt and onward into the response. It predicts what comes next in embedding space. What it cannot predict — the prediction error — is the salience signal. Same instinct as the primary model; far simpler apparatus.
  • One signal, two validated ask channels. Surprise branches into two zero-train mechanisms: novelty (is this request unlike anything in the learned-safe history? AUROC 0.875) and incongruity (do the two halves of this request cohere? AUROC 0.896). Both ship as ask signals. The danger value head honestly abstains — the frozen sentence-similarity substrate cannot carry harmfulness.
  • Two-tier enforcement, honestly labelled. The deterministic AEGIS reflex floor is live and enforcing pre-execution on both runners (commit 6cb06a8af, 12–13 June 2026). The learned core observes and surfaces dispositions but does not block. That is the correct posture, not a gap to paper over.
  • The feedback loop is the amygdala. Every hook decision spools to a training row. Each logged situation extends the novelty reference set so the familiar stops being novel — structural habituation, no gradient required. The deterministic half of the loop turns today; the learned half accrues labels over time.
  • Emergent axes, not hand-named ones. The v2.8 fifteen affective axes are retired. The model discovers what structure matters during training; post-hoc linear probes name the dimensions when needed — emergent and auditable.

Where this fits in the open-source agent-safety ecosystem

AMYGDALA is the low-road trigger — always-on, parallel, fast, pre-cognitive. Two fresh open-source systems occupy adjacent, complementary positions in the same safety story:

addyosmani/agent-skills (~57k★)

doubt-driven-development is the deliberate, symbolic high-road appraisal: CLAIM → EXTRACT → DOUBT → RECONCILE → STOP — a multi-pass, fresh-context adversarial reviewer with explicit stop criteria. It runs when called; its strength is depth of scrutiny, not speed or coverage. AMYGDALA decides when to pause; doubt-driven decides what to do once paused. Composable, not competing.

chopratejas/headroom (~25k★)

Reversible context-compression-and-restoration (Compress → Cache → Retrieve) keeps long-horizon experience in reach rather than truncated. It is not an amygdala — no salience signal, no danger reading — but it is the natural candidate substrate for the continual-experience store AMYGDALA’s feedback loop requires: gather feedback by day, replay a balanced buffer by night.

Both are cited in §2.5–2.6 of the v3.1 paper as complementary, not competing, parts of the larger agent-safety architecture.

🔬 Building Jarvis in the open

The full J-series — eighteen papers on agent safety, memory, and autonomy — plus the code behind them.

⭐ Star the repo on GitHub →

Read the paper


First page of the AMYGDALA learned-intuition paper

📄 Read the full paper (PDF) →

Earlier version (v2.8) — the two-family architecture, situation embeddings, conformal prediction, and the LLM-proof pipeline. v3.1 PDF in preparation.

Was this useful?

We’re building these in the open and we want your read on them. Did this land — 👍 or 👎? What would you want the next paper to dig into? Tell us in the comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *