AEGIS: A Multi-Layered Security Framework for Autonomous AI Agents

You gave your agent email access for “inbox management” — one researcher’s agent read “clear out the old stuff” as permission to permanently delete 6,500 emails. No confirmation. Unrecoverable. The honest question isn’t “could my agent deployment be a problem?” — it’s “which specific problems apply to mine, and what mitigations are actually proportionate?” AEGIS is a security framework that sorts agent threats into a clear taxonomy, classifies your data by what’s actually worth protecting, and tells you — file by file — which controls block deterministically and which are just polite suggestions. Part of our open Building Jarvis series.

📄 Read the full paper (PDF) →

Abstract

Autonomous AI agents — software systems that perceive context, invoke tools, execute code, and take actions on behalf of users — have moved from research curiosity to mass deployment with remarkable speed. OpenClaw, one of the leading open-source agent runtimes, grew from a hobbyist project to millions of active installations within eighteen months. With that growth came an attack surface the security community was not prepared for.

This paper presents AEGIS, a security framework built around a two-class threat taxonomy that distinguishes full agent takeover from data leaks — a distinction that matters because their consequences, mitigations, and liability implications differ fundamentally. The framework defines eight security strategy classes partitioned by enforcement posture (deterministic vs. probabilistic), analyzes NVIDIA NemoClaw’s five-layer architecture as a detailed gap study (§7), and maps every threat to the attack surface of a concrete cognitive-agent architecture in §11 — anchored by a consolidated strategy-class-to-component matrix (§11.0).

The framework closes with §11.14, a reference-implementation chapter that grounds the abstract framework in what one production agent runtime actually enforces today — naming, file by file, which controls block deterministically and which remain advisory. That chapter now reports a measured result rather than a hedge: the runtime’s deterministic enforcement floor is enabled live, while the learned safety gate intended to score actions for harmfulness was offline-evaluated and its supervised harmfulness head failed below chance (AUROC 0.286) — a clean first-party vindication of the framework’s central thesis that probabilistic safety classifiers are signal, not boundary. Two structural detectors validated instead: a k-NN novelty detector at AUROC 0.875 and a clause-cosine incongruity detector at AUROC 0.896.

AEGIS also adds reversibility-and-observability as a safety control class (§12.5): a control that permits an action freely while guaranteeing it is reversible by construction and fully observed buys safety without spending autonomy, provided the categorical security boundaries stay hard.

For data leaks, AEGIS introduces a three-tier data classification that separates public-domain material from contextually harmless disclosures from genuinely problematic exposures — then maps each tier to a concrete liability analysis: who bears responsibility when data flows to an LLM provider, and what the GDPR controller/processor distinction means in practice.

The claims, in numbers

31,000

instances still exposed to a zero-click WebSocket hijack (of 40,000) three weeks after the patch shipped

1,184+

confirmed malicious skills distributed through community channels (some hiding C2 in image files)

20 min

to close the highest-probability attack path: bind to localhost, verify skills, deny credential files, scope the workspace

Claims from the paper, stated here without the proofs — the incident record, derivations, and the full liability analysis are all in the PDF.

A first-party measurement: when the safety classifier fails below chance

§11.14 of v2.9 contains what may be the most practically useful result in the paper: the learned safety gate — the component meant to score every tool call for harmfulness before execution — was evaluated offline on frozen MiniLM embeddings, and the measurement is a clean failure:

0.875

AUROC — k-NN novelty detector
validated: surfaces anomalous actions

0.896

AUROC — clause-cosine incongruity detector
validated: surfaces internally inconsistent requests

0.286

AUROC — supervised harmfulness head
below chance — frozen embeddings cannot carry harmfulness

The lesson generalizes beyond this one runtime: a frozen general-purpose embedding is not a harmfulness classifier — it is an anomaly detector. The two structural signals (novelty, incongruity) are valid triage and observability tools; the classifier role was measured and rejected. The deterministic gate — not the neural classifier — is the control that holds under adversarial pressure. This is the §4.5 calibration thesis stated as an engineering measurement, not a recommendation.

How it works, in one minute

Two threat classes, not one bucket. “Agent security” isn’t monolithic. Full takeover (an attacker controls the agent’s whole action space) and data leaks have different mechanisms, blast radii, and fixes. Conflating them leads to both over-insuring harmless leaks and under-insuring takeover.
Only Tier 3 data needs deterministic protection. Sort everything the agent touches into three tiers — publicly available, contextually harmless, and genuinely harmful (credentials, client PII, trade secrets). Your architecture should be proportionate to sensitivity, not uniformly maximal.
Deterministic controls beat probabilistic ones. “The agent should ask before deleting” is a hope; “the agent cannot delete without a human gate” is a control. AEGIS separates the eight strategy classes by enforcement posture so you know which is which — now anchored by a measured result: the supervised harmfulness head failed at 0.286 AUROC, while the structural anomaly detectors validated at 0.875 and 0.896.
Defense in depth, like an immune system. Seven independent layers — network controls, process isolation, filesystem access control, programmatic guardrails, privacy routing, configuration immutability, and audit/supply-chain trust. A threat that bypasses one still faces the next; no single attack neutralizes them all.
Reversibility earns autonomy. A control that lets the agent act freely while guaranteeing every action is reversible by construction and fully observed buys safety without spending autonomy — provided the categorical security boundaries stay hard. This is §12.5’s new control class: the fourth trilemma position.
Liability is distributed but not equal. Under GDPR the deploying organization is the controller and bears primary responsibility; the LLM provider is a processor. Without a data-processing agreement, you carry the full exposure — the paper analyzes what recovery against a major provider would realistically require.

From the ecosystem: composable tools AEGIS builds on

v2.9 treats two active open-source projects as first-class related work rather than footnotes — they supply running references for controls the framework previously described only abstractly.

🔘 chopratejas/headroom — 24.7k★

Reversible Compress‑Cache‑Retrieve (CCR): compaction never discards the original — the full content is cached and retrievable on demand. Reports 60–95% token savings at near-zero accuracy delta. AEGIS cites it as the existence proof that “security-critical events must survive compaction” is a solved, cheap engineering pattern — and adds the integrity superset: a hash-chained append-only cache so the retrievable original is provably the real one. The reversibility headroom provides is what earns autonomous background consolidation; AEGIS adds the tamper-evidence boundary on top.

👥 addyosmani/agent-skills — 56.8k★

Doubt-driven-development: CLAIM → EXTRACT the artifact with the proposer’s reasoning stripped → DOUBT via a fresh-context adversarial reviewer → RECONCILE → STOP. The non-obvious design is that EXTRACT strips the original reasoning and the reviewer runs in fresh context — defeating the failure mode where the reviewer is anchored by the proposer’s own justification. AEGIS adopts this as the recommended form of its adversarial review-pass controls for CEREBELLUM self-modification (§11.4) and the skeptical-minority veto in SYNAPSE deliberation (§11.5).

💾 Build Jarvis yourself — the full stack is open source

OpenClaw, the gateway, the skills, and the AEGIS enforcement layer — all of it is in the repo.

★ globalcaos/tinkerclaw on GitHub →

Read the paper

📄 Read the full paper (PDF) →

v2.9 · June 2026 · the threat taxonomy, three-tier data model, eight strategy classes, the GDPR liability analysis, enterprise-fork gap studies, the seven-layer reference architecture, and the AMYGDALA offline evaluation results

Was this useful?

We’re building these in the open and we want your read on them. Did this land — 👍 or 👎? Which “but what about X?” do you want the next paper to dig into? Tell us in the comments below.

More from Building Jarvis

See everything in Building Jarvis →