arxivApril 22, 2026 at 4:00 AM2 min readneutral

Owner-Harm: A Missing Threat Model for AI Agent Safety

View PDF HTML (experimental) Abstract:Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar-injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner-Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%-32.5%) on AgentDojo injection tasks (prompt-injection-mediated owner harm). A controlled generic-LLM baseline shows the gap is not inherent to owner-harm (62.7% vs. 59.3%, delta 3.4 pp) but arises from environment-bound symbolic rules that fail to generalize across tool vocabularies. On a post-hoc 300-scenario owner-harm benchmark, the gate alone achieves 75.3% TPR / 3.3% FPR; adding a deterministic post-audit verifier raises overall TPR to 85.3% (+10.0 pp) and Hijacking detection from 43.3% to 93.3%, demonstrating strong layer complementarity. We introduce the Symbolic-Semantic Defense Generalization (SSDG) framework relating information coverage to detection rate. Two SSDG experiments partially validate it: context deprivation amplifies the detection gap 3.4x (R = 3.60 vs. R = 1.06); context injection reveals structured goal-action alignment, not text concatenation, is required for effective owner-harm detection. Comments: 15 pages. Companion manuscript on per-decision proof-obligation synthesis (LSVJ-S) in preparation Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACM classes: I.2.11; D.4.6; I.2.4 Cite as: arXiv:2604.18658 [cs.CR] (or arXiv:2604.18658v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.18658 arXiv-issued DOI via DataCite (pending registration) Submission history From: Dario Zhang [view email] [v1] Mon, 20 Apr 2026 10:11:26 UTC (26 KB)

Read original article ↗

No replies yet. Be first.

techcrunch36m ago

Owner-Harm: A Missing Threat Model for AI Agent Safety

Related Articles

Google Cloud launches two new AI chips to compete with Nvidia

Watch Sony’s elite ping-pong robot beat top-ranked players

Google turns Chrome into an AI co-worker for the workplace