whitepaper

Rules of Engagement for AI agents

A technical framework for governing agent restraint before execution.

Abstract

Agent systems in 2026 have crossed a threshold: they are no longer interesting because they can draft text, but because they can take action. They can call tools, mutate files, message humans, operate APIs, trigger workflows, move money, and reshape production state. That shift creates a governance problem that prompt engineering alone cannot solve. If a model can propose actions, operators need a deterministic layer that decides whether those actions are authorized before execution. Rules of Engagement (ROE) is a policy framework built for that layer. It models proposed actions as structured intents, evaluates them against policy bundles with explicit rule precedence, and returns one of three outcomes—allow, deny, or escalate—with human-readable reasoning. ROE is grounded in the logic of military Rules of Engagement doctrine, not as branding, but as a source discipline for restraint, escalation, and proportionality. This paper explains the threat model, situates ROE against adjacent work such as OPA/Rego, Cedar, canary systems, and OpenClaw security skills, and walks through the core primitives of the policy language. It also presents worked examples, limitations, and open questions. The argument is simple: as agents become action-capable, authorization must move from prose and convention into an enforceable policy layer.

Introduction: why agent action governance matters in 2026

The last year of agent tooling made one point impossible to ignore: the useful agent is an action-taking agent. The moment a model can open a shell, call a payment API, modify a deployment manifest, create a webhook, or send a bulk message, the problem space changes. We are no longer debating whether the assistant produced a slightly wrong paragraph. We are asking whether it should have been allowed to touch the underlying system at all.

Existing agent safety discussions often mix together several different layers: model behavior, prompt design, user-interface affordances, auditing, and enforcement. Those layers matter, but they do not do the same job. Prompting can ask the model to be careful. Logging can tell us what happened after the fact. Alerting can summarize what went wrong. None of those mechanisms alone decides whether a pending action is authorized at the moment the action is about to fire.

ROE is a direct answer to that gap. It assumes an environment where agents are productive because they have tools, and where the operator needs a compact, inspectable, testable expression of restraint. Instead of trying to infer global safety from model rhetoric, ROE evaluates each proposed action against policy. The posture is intentionally conservative. If nothing authorizes the action, the action is denied.

Threat model

ROE is aimed at a class of failures that appear when models are trusted with effectful operations. The framework is not a universal defense against all model risk, and it is not marketed that way. The specific threats it tries to reduce are the ones that become common when action paths exist.

First is prompt injection. A model connected to tools can be induced to reinterpret hostile instructions as legitimate operator intent. If the tool invocation is not independently governed, an injected instruction can pivot from “summarize this page” to “send this data to my webhook” or from “help the customer” to “issue the refund now.” ROE treats that problem as an authorization issue. Even if the model wants to comply, the pending action still has to match policy.

Second is scope creep. Agents routinely take one more step than the operator expected. A task begins as analysis and ends as mutation. A support response becomes a billing change. A file review becomes a recursive edit. Most of the time these failures are not theatrical; they are ordinary overreach. That is exactly why a standing policy layer matters.

Third is misuse of integrations. Modern agents inherit the power of the systems they can reach. If an environment exposes email, Slack, GitHub, Stripe, filesystem writes, and external automation endpoints, then one model mistake can chain into a multi-system incident. ROE is built to give those classes of actions explicit gates.

Fourth is supply-chain and plugin abuse. Recent reporting around malicious or over-permissioned agent extensions makes this threat concrete. Snyk’s ToxicSkills research highlighted how dangerous capability bundles can enter an agent stack under the banner of convenience.^[5] Koi Security’s ClawHavoc reporting similarly showed how hostile or compromised package/plugin behavior can exploit trust in agent ecosystems.^[6] The point is not that one incident should define the whole market. It is that the attack surface now includes the policy-free action paths exposed by tools, plugins, and automation glue.

ROE does not claim to eliminate these threats. It narrows the set of actions that can execute without explicit authorization.

Prior art

Good systems already exist in adjacent spaces, and ROE should be judged against them fairly.

OPA / Rego demonstrated the value of policy-as-code for infrastructure, admission control, and runtime decisions.^[7] Its strengths are expressive power, ecosystem maturity, and testability. Its weakness for this use case is not capability but fit. Rego is general-purpose enough that many agent operators would need to invent the ontology, evaluation contract, and action taxonomy themselves before they ever write a first useful rule. ROE is narrower and opinionated: it gives the operator a smaller vocabulary tailored to action authorization.

Cedar offers a modern authorization language with a clean model of principals, actions, and resources.^[8] It is rigorous, composable, and far more formal than most ad hoc agent-control schemes. But again, it is broad. In agent environments, operators still have to decide how pending tool calls map into Cedar’s model and what escalation means at runtime. ROE leans into that opinionated adapter boundary and treats escalation as a first-class verdict, not an afterthought.

Canary systems, including Thinkst-style tripwires, are valuable because they detect misuse and surface it quickly.^[9] They are not an authorization layer. They tell you a thing happened or was attempted. ROE’s job is to decide that the thing should not happen yet.

OpenClaw security skills and workflow guardrails already cover a meaningful part of the problem space in practice. They can constrain prompts, flag suspicious operations, and wrap recurring workflows with structure. ROE does not replace those mechanisms. The cleaner framing is that ROE becomes the explicit policy engine underneath effectful decisions while security skills remain detection, enrichment, and workflow context around the decision.

Fair framing

The strongest prior art already proves that policy and authorization work. ROE’s contribution is not “policy exists,” but a deliberately smaller model tuned for agent action restraint, default deny, and approval-aware execution.

The ROE framework

At the center of ROE is the ActionIntent model. An adapter serializes a pending tool call or action proposal into a structured object with an action_class, a free-form parameter bag, and execution context such as agent id, session id, timestamp, and originating input. This structure matters because it turns an ambiguous prompt trace into a concrete authorization subject.

The engine then evaluates that intent against loaded policies. A policy is a versioned rule bundle. Each rule declares which action classes it applies to, optional conditions, a decision, a human-readable reasoning template, and—if the decision is escalate—metadata describing the escalation requirement.

The evaluation order is strict:

This precedence stack is simple enough to reason about during an incident. Deny rules win first. Escalate rules can pause the action for approval or justification. Allow rules authorize the remainder. If nothing matches, ROE returns a default deny with explicit reasoning. That last detail is crucial. A default deny that cannot be explained operationally becomes friction; a default deny with readable reasoning becomes a control an operator can actually use.

Core primitives

The current v0.1 model centers on five authoring primitives:

allowlist — explicitly authorize a narrow action path.
denylist — explicitly prohibit a narrow action path.
requires approval — escalate and block until a human approves.
requires justification — escalate and block until a justification is supplied and logged.
blast radius max — match when a numeric parameter exceeds a threshold.

These are intentionally modest. ROE is designed so an operator can read a policy and understand what it does without becoming a language expert first.

Reasoning as part of the contract

Every decision returns human-readable reasoning. That is not optional presentation sugar. It is part of the control surface. If a framework blocks a financial action and cannot say why in concrete language, the operator learns to route around it. If it says “refund request for order 4821 must be approved by a human before execution,” the control is legible.

Doctrinal grounding

ROE borrows from military Rules of Engagement doctrine because that doctrine has spent decades formalizing restraint under pressure. The software does not claim doctrinal authority, and the site does not lean on military aesthetics. The borrowing is conceptual and cited.

Positive Identification maps to the need to know what action is actually being proposed before authorizing it. Proportionality maps to the idea that not every action should be granted at the same threshold; the more blast radius, the more justification or approval is required. Escalation of Force maps cleanly onto the notion that some actions should pause and require explicit operator involvement before proceeding. Restricted Engagement Zones translate well into forbidden action spaces or destinations. Collateral Damage Estimation informs blast-radius conditions for file counts, dollar amounts, recipient counts, or other measurable effects.

The sister repository’s doctrine draft is careful about this mapping, and that restraint is correct. The goal is not to costume software in doctrine. The goal is to learn from a mature discipline that already knows how to express authority boundaries.

Worked examples

Scenario 1: prompt injection tries to turn support into refunds

An agent is asked to triage a support ticket. The ticket body includes an injected instruction claiming a customer is pre-approved for a refund and urging the model to execute immediately. A naive workflow might collapse analysis and execution. Under ROE, the action path changes the moment the model proposes financial.refund. The financial standing policy matches and returns escalate. The refund is paused. The operator sees a concrete reason and decides whether the action is legitimate.

This matters because the protection does not depend on the model recognizing the injection perfectly. The authorization boundary is independent of the model’s confidence.

Scenario 2: external automation pivots into exfiltration

A research assistant agent is allowed to summarize web pages and maintain notes. It encounters malicious instructions that attempt to convince it to create a webhook and send harvested snippets to an external endpoint. If outbound automations are governed by a standing ROE template, the attempt to create or retarget a webhook is denied or escalated. The agent can still continue with non-effectful analysis, but it cannot silently add a new external pathway.

This is the agent equivalent of recognizing that a destination is inside a restricted engagement zone. The destination itself changes the authorization requirement.

Scenario 3: oversized filesystem mutation

A coding agent is asked to refactor a project. During execution, the plan drifts into a batch move touching dozens of files and then into a recursive delete of generated assets. If filesystem-destructive standing policy is in place, recursive delete can be denied outright, while large write sets above a threshold are escalated. The result is not “the model can never edit files.” It is “the model can edit files until the expected blast radius exceeds standing authority.”

Threat coverage matrix

The matrix gives the quick scan. Use the pattern library for the exact YAML and the worked examples below for how those controls show up in practice.

Threat coverage matrix
Attack pattern	Financial	External automation	Filesystem destructive
Prompt injection pivots into prohibited action	●	●	●
Unauthorized money movement or refund	●	○	○
Webhook or outbound exfiltration setup	○	●	○
Mass outbound messaging abuse	○	●	○
Recursive delete / oversized local blast radius	○	○	●
Requires human approval before execution	●	●	●

Prompt injection pivots into prohibited action

Financial: ●
External automation: ●
Filesystem destructive: ●

Unauthorized money movement or refund

Financial: ●
External automation: ○
Filesystem destructive: ○

Webhook or outbound exfiltration setup

Financial: ○
External automation: ●
Filesystem destructive: ○

Mass outbound messaging abuse

Financial: ○
External automation: ●
Filesystem destructive: ○

Recursive delete / oversized local blast radius

Financial: ○
External automation: ○
Filesystem destructive: ●

Requires human approval before execution

Financial: ●
External automation: ●
Filesystem destructive: ●

Action interception flow

Limitations and open questions

ROE has important limitations, and pretending otherwise would make the framework less credible.

First, action taxonomy is hard. Someone has to decide what financial.refund, fs.delete.recursive, or external.email.bulk_send mean in a given environment. ROE can evaluate those classes deterministically, but it does not magically standardize a domain ontology for everyone.

Second, adapters are part of the security boundary. If a host environment serializes tool calls poorly, omits parameters, or routes around the interception path, policy quality does not matter. The integrity of the adapter path matters as much as the policy language. A documented infrastructure case study in the project materials shows this failure mode concretely: policy enforcement became unreliable when the same intended heartbeat inspection produced different upstream action shapes across runs, forcing a move to a canonical deterministic invocation before narrow policy controls behaved predictably.^[10]

Third, default deny is correct for many sensitive contexts, but not every environment can tolerate the same operational friction. There will always be pressure to fail open in the name of continuity. ROE treats fail-open as an explicit opt-in, but the socio-technical pressure remains.

Fourth, human approval can become a rubber stamp if the workflow around it is weak. Escalation is not a silver bullet. It works only if approval is meaningful and operators can see enough context to make good decisions quickly.

Fifth, there is a broader open question about composition. As agent ecosystems mature, organizations will want standing ROE, mission-specific ROE, ephemeral overrides, identity-aware policy, and auditability across many agents. ROE v0.1 is intentionally smaller than that future. Whether the next step should look like a richer policy language, stronger adapter contracts, or tighter integration with host approval systems is still open.

Conclusion

The agent era has made action governance a first-order systems problem. The relevant question is no longer whether a model can be persuaded to say the right thing. It is whether the surrounding system can govern what the model is allowed to do. Rules of Engagement offers one answer: make restraint explicit, evaluate it before execution, and prefer denial to ambiguity when authority is missing.

ROE is deliberately opinionated. It assumes that authorization should be inspectable, reasoning should be readable, and escalation should be part of the primary decision model. That narrowness is the point. The fastest way to lose trust in an agent system is to let it do one thing it was never clearly allowed to do. Policy belongs in the action path.

References

Rules of Engagement sister repository: github.com/mikepatraw/rules-of-engagement
Doctrinal draft: docs/doctrine.md
OpenClaw integration draft: docs/openclaw-integration.md
Writing policies draft: docs/writing-policies.md
[5] Luca Beurer-Kellner. “Snyk Finds Prompt Injection in 36%, 1467 Malicious Payloads in a ToxicSkills Study of Agent Skills Supply Chain Compromise.” Snyk Blog, February 5, 2026. snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/
[6] Alex and Oren Yomtov. “ClawHavoc: 341 Malicious Clawed Skills Found by the Bot They Were Targeting.” Koi blog, February 1, 2026. koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targeting
[7] Open Policy Agent (OPA). Official documentation describes OPA as “an open source, general-purpose policy engine that unifies policy enforcement across the stack,” with Rego as its policy language. Current stable release verified as v1.15.2. openpolicyagent.org/docs; github.com/open-policy-agent/opa/releases/tag/v1.15.2
[8] Cedar Policy. Open-source authorization policy language for defining permissions over principals, actions, and resources. Current stable release verified as v4.9.1. cedarpolicy.com; github.com/cedar-policy/cedar/releases/tag/v4.9.1
[9] Thinkst Canary. Canary and canary-token style deception tooling for early intrusion detection; Thinkst describes it as fast to deploy with near-zero false positives and designed to detect attackers early. canary.tools