Measuring Risk in Deployed AI Agents: The 30-Day Window

Risk quantification is one of the hardest problems in deploying AI agents at scale. Human intuition about what constitutes risky behaviour is hard to formalise. Statistical models are brittle to distribution shift. And the consequences of getting it wrong are asymmetric: false positives create operational disruption; false negatives allow genuine problems to continue undetected.

Despite these challenges, organisations deploying AI agents in regulated environments need a risk quantification framework. Regulators require it. Enterprise customers demand it. And the alternative — operating without any systematic risk assessment — is a liability that accumulates invisibly until something goes wrong.

This is a deep look at how risk scoring for AI agents works in practice: the technical design decisions, the calibration challenges, and what the risk score actually tells you.

Why Risk Scoring Is Different for AI Agents

Risk scoring is well-developed in several adjacent domains — credit risk, fraud detection, cybersecurity — and it is tempting to assume that the techniques from those domains transfer directly to AI agent monitoring. They do not, for several reasons.

AI agents exhibit emergent behaviours. A credit score predicts a borrower's behaviour based on their individual history and population-level patterns. AI agents are optimising processes — their behaviour is shaped by their training, their instructions, and the environment they operate in. Novel inputs produce novel outputs. An AI agent that has never encountered a specific type of market condition will produce outputs that its historical behaviour does not predict.

AI agent behaviour can shift rapidly. A human's credit risk changes slowly. An AI agent can go from normal to highly anomalous in a single request cycle if it receives a prompt injection attack, encounters a malformed input, or hits an edge case in its training distribution. A risk model that uses only rolling averages over long time windows will miss rapid onset anomalies.

AI agent risk is policy-relative. Whether a behaviour is risky depends on what the agent is authorised to do. An AI trading agent submitting 1,000 orders per second might be perfectly normal for a high-frequency market-making strategy and catastrophically wrong for an agent that is supposed to submit at most 10 orders per minute. The risk score must be calibrated against the specific agent's authorised parameters, not against population-level norms.

Counterfactual attribution is difficult. When a credit model flags a borrower as high-risk, the analyst can ask: which features drove the score? For AI agent risk, the causal chain from inputs to risk score is more complex. An anomalous event at 9:00am might only become significant in the context of a pattern that emerged over the previous 24 hours.

These characteristics shape every design decision in Kakunin's risk scoring model.

The 30-Day Rolling Window: Rationale

The choice of a 30-day rolling window for AI agent risk scoring is not arbitrary. It reflects a specific set of trade-offs.

A shorter window — say, 24 hours — produces a baseline that reflects only very recent history. This is responsive to rapid change, but it creates instability: an agent handling an unusually busy Monday will produce anomaly signals on subsequent quieter days, and vice versa. Short-window models generate more false positives and require more frequent manual review.

A longer window — say, 90 or 180 days — produces a more stable baseline. It is less sensitive to legitimate operational variation. But it is also slow to detect genuine behavioural drift. An agent that gradually shifts its behaviour over two months will still look "normal" relative to a six-month average, because the recent anomalous period is a small fraction of the window.

Thirty days represents the sweet spot for most regulated enterprise AI agent deployments. It is long enough to include full monthly business cycles (relevant for financial services operations that have monthly settlement rhythms), short enough to detect drift that occurs over days or weeks, and short enough that the computation is manageable at scale.

For specific deployment contexts, the 30-day window can be the wrong choice. High-frequency trading agents, which operate at millisecond timescales, need much shorter windows — perhaps 15 minutes of rolling history. Healthcare diagnostic agents that handle infrequent complex cases might need longer windows to build a meaningful baseline. Kakunin's implementation uses 30 days as the production default, with the understanding that different use cases may require different calibration.

The Risk Score Architecture: Components

Kakunin's risk score is a composite metric. It combines several component scores, each capturing a different dimension of agent behaviour.

Volume deviation score measures whether the agent's transaction or event volume is outside its historical range. It is computed as the z-score of the current volume relative to the rolling 30-day mean and standard deviation. An agent with a historical average of 100 transactions per hour that suddenly processes 500 per hour has a high volume deviation score. An agent with high historical variance (legitimately handling 50–500 per hour depending on market conditions) would need a much larger deviation to produce the same score.

Distribution shift score measures whether the types of actions the agent is taking are outside its historical distribution. If an agent typically splits its time 70% on transaction events and 30% on data access events, a shift to 20% transaction and 80% data access is flagged. This component catches the kind of behavioural drift that does not show up in volume metrics — an agent that is doing approximately the same number of things but doing different things.

Novelty score flags events that are categorically unlike the agent's historical behaviour. An agent that makes its first-ever call to a new API endpoint, accesses a data source it has never touched before, or encounters an error type it has never previously generated produces a high novelty score for those specific events. Novelty scoring is particularly useful for detecting prompt injection, where an attacker causes the agent to take actions that are completely outside its design intent.

Velocity score measures how quickly the agent's behaviour is changing, independent of whether it is currently anomalous. An agent whose behaviour is changing rapidly — even if each individual day still looks reasonable — is on a trajectory that warrants attention. Velocity scoring is the early warning mechanism that detects drift before it becomes an anomaly by absolute standards.

Authentication anomaly score specifically tracks deviations in the agent's authentication patterns — failure rates, unusual target systems, credential rotation patterns. Authentication anomalies are often the earliest detectable signal of credential compromise.

These components are combined into a single composite score using a weighted sum, with weights calibrated against the historical incident patterns in the training data. Volume deviation and novelty typically carry the highest weights because they are the most predictive of genuine problems; velocity carries a lower weight because it generates meaningful early warning but is less diagnostic on its own.

Threshold Calibration: The Decision Science

The choice of threshold values — 0.75 for pre-warning, 0.85 for auto-revocation — reflects a specific risk tolerance and operational philosophy.

In any binary classification system, there is a fundamental trade-off between false positive rate and false negative rate. If you set the threshold very low (revoke at 0.5), you catch more genuine problems but you also generate more false revocations of agents that are behaving legitimately. If you set the threshold very high (revoke at 0.99), you virtually eliminate false revocations but allow many genuine problems to persist undetected.

The 0.85 threshold is calibrated for regulated enterprise deployments where:

False revocations are costly (operational disruption, customer impact, reputational risk)
Genuine problems above this threshold have high expected harm
The monitoring system provides a secondary layer of human oversight (the 0.75 pre-warning) before auto-revocation triggers

Different deployment contexts warrant different thresholds. An AI agent involved in healthcare triage, where an undetected failure could cause patient harm, might warrant a lower threshold — the cost of false revocations (operational disruption) is lower than the cost of missed detections (patient harm). An AI agent involved in low-stakes content recommendations, where failures are embarrassing but not harmful, might warrant a higher threshold.

For regulated operators, the threshold selection should be documented and justified in the risk management framework required under EU AI Act Article 9. The justification should explain why the chosen threshold is appropriate given the agent's specific risk profile and the consequences of both false positives and false negatives.

Reading the Risk Score in Practice

The risk score is most informative as a trajectory, not a point-in-time value. A score of 0.7 that has been stable for two weeks is very different from a score of 0.7 that was 0.2 last week and is trending upward at 0.05 per day.

When an agent's score begins rising, the first question to ask is: what is driving it? The component scores tell you this. If the volume deviation component is the primary driver, you are looking at a volume change — the agent is doing more or less than usual. If the novelty component is driving, you are seeing the agent do things it has never done before. If the velocity component is primary, the agent's behaviour is changing rapidly across multiple dimensions.

The next question is: does this correspond to anything I know about? A volume spike on a day when the agent was legitimately handling higher load (a product launch, a market event, an end-of-month processing surge) is expected and explained. An unexplained volume spike is not.

If the elevated score cannot be explained by known operational context, it warrants investigation. The full event log for the agent, accessible via the event ingestion documentation, provides the granular record needed to investigate specific anomalies.

The Risk Score and the Audit Trail

The risk score alone is not sufficient for regulatory purposes. What regulators and auditors need is the risk score in context — understanding what drove it, what decisions were made in response to it, and what the outcomes were.

Kakunin's audit trail links every risk score calculation to the specific events that fed into it. When a compliance team exports the audit trail for an investigation period, they can see not just the score trajectory but the underlying event stream that produced it. They can follow the chain from individual events through component scores to the composite risk score, and from the risk score to any alerts or revocations that were triggered.

This linked audit trail is what satisfies MiCA Article 75's requirement for "sufficient detail to allow reconstruction." You are not just providing a number; you are providing the complete evidentiary chain that produced that number.

The compliance concepts documentation covers the specific data model linking events, component scores, composite scores, and alerts in the audit trail.

External Research on AI Risk Quantification

The academic literature on AI system risk quantification is developing rapidly. Several research directions are relevant to practical AI agent monitoring.

Behavioral anomaly detection for AI systems draws on the broader literature on out-of-distribution detection, which addresses the question of how to identify when a model is operating on inputs that are unlike its training data. Research from Hendrycks et al. (2016) on baseline detectors for neural networks, and subsequent work on uncertainty quantification in deep learning, is directly applicable to the novelty scoring component.

The ENISA threat landscape for AI systems, updated annually, provides a threat taxonomy that is useful for structuring the event categories that feed into risk scoring. The 2024 edition specifically includes threats relevant to agentic AI systems.

For financial AI specifically, the EBA guidelines on internal governance include specific requirements for model risk management that are applicable to AI agent risk quantification. Treating the AI agent risk scoring model as a model that itself requires validation and governance is good practice.

From Risk Score to Operational Response

The risk score is the input to an operational response framework. Without a defined response framework, a sophisticated risk scoring system produces alerts that nobody acts on.

The operational response framework for AI agent risk has three tiers:

Tier 1 (0.3–0.75): Increased monitoring. No intervention required, but the on-call team is notified of the elevated score. The agent continues operating. The monitoring frequency increases — risk score updates every 30 seconds instead of every minute. The event log for this agent is flagged for manual review during the next business day.

Tier 2 (0.75–0.85): Pre-revocation alert. The on-call engineer is paged. The engineer has the context from the risk score component breakdown and the recent event log. They make a judgment call: is this a legitimate operational anomaly (clear the alert) or a genuine problem (manually revoke or escalate)? If no action is taken within a defined SLA (typically 15–30 minutes), the system escalates automatically.

Tier 3 (>0.85): Auto-revocation. The agent's certificate is revoked within 60 seconds. The on-call engineer receives a revocation notification with the full context. Their task at this point is incident response — understanding what happened, communicating to stakeholders, and managing the recovery process.

The enforcement documentation includes the specific runbook for each tier, including the API calls for manual intervention and the escalation paths for each scenario.

For DevOps teams building the operational response capability, the /for-devops page covers the alert routing, on-call integration, and incident response patterns that make this framework operational.

---

Kakunin's risk scoring model uses a 30-day rolling window across five component scores: volume deviation, distribution shift, novelty, velocity, and authentication anomalies. Composite risk score ranges from 0 to 1, with automated response tiers at 0.75 and 0.85. See the event ingestion documentation or explore how compliance officers use the risk monitoring.