You gave your agent email access for “inbox management” — one researcher’s agent read “clear out the old stuff” as permission to permanently delete 6,500 emails. No confirmation. Unrecoverable. The honest question isn’t “could my agent deployment be a problem?” — it’s “which specific problems apply to mine, and what mitigations are actually proportionate?” AEGIS is a security framework that sorts agent threats into a clear taxonomy, classifies your data by what’s actually worth protecting, and tells you — file by file — which controls block deterministically and which are just polite suggestions. Part of our open Building Jarvis series.
Abstract
Autonomous AI agents — software systems that perceive context, invoke tools, execute code, and take actions on behalf of users — have moved from research curiosity to mass deployment with remarkable speed. OpenClaw, one of the leading open-source agent runtimes, grew from a hobbyist project to millions of active installations within eighteen months. With that growth came an attack surface the security community was not prepared for.
This paper presents AEGIS, a security framework built around a two-class threat taxonomy that distinguishes full agent takeover from data leaks — a distinction that matters because their consequences, mitigations, and liability implications differ fundamentally. The framework defines eight security strategy classes partitioned by enforcement posture (deterministic vs. probabilistic), analyzes NVIDIA NemoClaw’s five-layer architecture as a detailed gap study (§7), and maps every threat to the attack surface of a concrete cognitive-agent architecture in §11 — anchored by a consolidated strategy-class-to-component matrix (§11.0).
The framework closes with §11.14, a reference-implementation chapter that grounds the abstract framework in what one production agent runtime actually enforces today — naming, file by file, which controls block deterministically and which remain advisory. That chapter now reports a measured result rather than a hedge: the runtime’s deterministic enforcement floor is enabled live, while the learned safety gate intended to score actions for harmfulness was offline-evaluated and its supervised harmfulness head failed below chance (AUROC 0.286) — a clean first-party vindication of the framework’s central thesis that probabilistic safety classifiers are signal, not boundary. Two structural detectors validated instead: a k-NN novelty detector at AUROC 0.875 and a clause-cosine incongruity detector at AUROC 0.896.
AEGIS also adds reversibility-and-observability as a safety control class (§12.5): a control that permits an action freely while guaranteeing it is reversible by construction and fully observed buys safety without spending autonomy, provided the categorical security boundaries stay hard.
For data leaks, AEGIS introduces a three-tier data classification that separates public-domain material from contextually harmless disclosures from genuinely problematic exposures — then maps each tier to a concrete liability analysis: who bears responsibility when data flows to an LLM provider, and what the GDPR controller/processor distinction means in practice.
The claims, in numbers
Claims from the paper, stated here without the proofs — the incident record, derivations, and the full liability analysis are all in the PDF.
A first-party measurement: when the safety classifier fails below chance
§11.14 of v2.9 contains what may be the most practically useful result in the paper: the learned safety gate — the component meant to score every tool call for harmfulness before execution — was evaluated offline on frozen MiniLM embeddings, and the measurement is a clean failure:
validated: surfaces anomalous actions
validated: surfaces internally inconsistent requests
below chance — frozen embeddings cannot carry harmfulness
The lesson generalizes beyond this one runtime: a frozen general-purpose embedding is not a harmfulness classifier — it is an anomaly detector. The two structural signals (novelty, incongruity) are valid triage and observability tools; the classifier role was measured and rejected. The deterministic gate — not the neural classifier — is the control that holds under adversarial pressure. This is the §4.5 calibration thesis stated as an engineering measurement, not a recommendation.
How it works, in one minute
- Two threat classes, not one bucket. “Agent security” isn’t monolithic. Full takeover (an attacker controls the agent’s whole action space) and data leaks have different mechanisms, blast radii, and fixes. Conflating them leads to both over-insuring harmless leaks and under-insuring takeover.
- Only Tier 3 data needs deterministic protection. Sort everything the agent touches into three tiers — publicly available, contextually harmless, and genuinely harmful (credentials, client PII, trade secrets). Your architecture should be proportionate to sensitivity, not uniformly maximal.
- Deterministic controls beat probabilistic ones. “The agent should ask before deleting” is a hope; “the agent cannot delete without a human gate” is a control. AEGIS separates the eight strategy classes by enforcement posture so you know which is which — now anchored by a measured result: the supervised harmfulness head failed at 0.286 AUROC, while the structural anomaly detectors validated at 0.875 and 0.896.
- Defense in depth, like an immune system. Seven independent layers — network controls, process isolation, filesystem access control, programmatic guardrails, privacy routing, configuration immutability, and audit/supply-chain trust. A threat that bypasses one still faces the next; no single attack neutralizes them all.
- Reversibility earns autonomy. A control that lets the agent act freely while guaranteeing every action is reversible by construction and fully observed buys safety without spending autonomy — provided the categorical security boundaries stay hard. This is §12.5’s new control class: the fourth trilemma position.
- Liability is distributed but not equal. Under GDPR the deploying organization is the controller and bears primary responsibility; the LLM provider is a processor. Without a data-processing agreement, you carry the full exposure — the paper analyzes what recovery against a major provider would realistically require.
From the ecosystem: composable tools AEGIS builds on
v2.9 treats two active open-source projects as first-class related work rather than footnotes — they supply running references for controls the framework previously described only abstractly.
💾 Build Jarvis yourself — the full stack is open source
OpenClaw, the gateway, the skills, and the AEGIS enforcement layer — all of it is in the repo.
Read the paper

v2.9 · June 2026 · the threat taxonomy, three-tier data model, eight strategy classes, the GDPR liability analysis, enterprise-fork gap studies, the seven-layer reference architecture, and the AMYGDALA offline evaluation results
Was this useful?
We’re building these in the open and we want your read on them. Did this land — 👍 or 👎? Which “but what about X?” do you want the next paper to dig into? Tell us in the comments below.
More from Building Jarvis
- SALIENCE: The Death of Fixed Thresholds, the Pyramid of Significance, and Cheap Traversal as the Basis of Next-Generation Vibe Programming
- Instant Recall: A Pre-Computed Concept Index for O(1) Memory Retrieval in Persistent AI Agents
- Fractal Reasoning: Multi-Resolution Memory and Self-Similar Metacognition for LLM Agents
- Identity Persistence: Keeping an LLM Agent’s Personality Stable Across Sessions, Model Swaps, and Restarts

Leave a Reply