Cognitive SLA — why 99.9% uptime is not enough when AI supports decisions in your company

Mar 6
8 min read

Series: CDF 1.3.2 in practice — 6 articles on the methodology of sovereign AI implementations

This is the first of 6 articles on the CDF 1.3.2 methodology — a framework for implementing sovereign AI for regulated sectors, developed by allclouds.pl. The SAVANT series focuses on compliance, quality measurement, and supervision. The GENESIS series focuses on agent governance, scaling, and operations. CDF 1.3.2 is a proprietary methodology developed by allclouds.pl, based on ISO/IEC 42001:2023 and the EU AI Act.

The AI system has been running continuously for three months. Infrastructure monitoring is green: 99.97% uptime, latency below 200 milliseconds, no failures. And yet, every twentieth response in the customer service process is based on information that cannot be confirmed by any source.

From a traditional SLA perspective, the system is healthy. From the perspective of an organization that makes decisions based on these responses, the system generates risk—legal, financial, and reputational. It is this gap between "working" and "working correctly" that is the reason why CDF 1.3.2 introduces the concept of Cognitive SLA.

Where the classic SLA ends

Traditional IT metrics — uptime, latency, throughput — were created in a world where systems were supposed to be available and fast. If the server responds, the service works. This is a reasonable assumption for databases, web applications, or ERP systems.

But cognitive systems work differently. A language model or AI agent can be available, fast, and stable, yet hallucinate, rely on outdated knowledge, miscalibrate the confidence of its responses, or take contradictory actions in a multi-agent environment. None of these problems will appear in a classic infrastructure dashboard because classic metrics simply do not measure the quality of reasoning.

Appendix B of the CDF 1.3.2 methodology puts it bluntly: traditional IT SLA metrics are necessary but insufficient for cognitive systems. And it's hard to disagree with that when you consider that LLM and agent systems are probabilistic by nature — their responses will never be fully deterministic.

Three layers, not one

CDF 1.3.2 does not reject the classic SLA. It expands it with two additional layers, creating a three-level model.

Layer	What it measures	Who is responsible
Infrastructure SLA	Uptime ≥99.9%, latency ≤200 ms, throughput	DevOps / Platform Team
Cognitive SLA	Reasoning Accuracy, Hallucination Rate, Knowledge- -Freshness, Confidence Calibration	CogOps / Knowledge Curator
Agent SLA	Coordination Effectiveness, Task Success Rate, Recovery Rate, Token Budget Compliance	Agent Lifecycle Engineer

This distinction has important organizational implications. The infrastructure layer remains the responsibility of the platform team, but responsibility for reasoning quality and agent coordination shifts to new roles: the CogOps team, the knowledge curator, and the agent lifecycle engineer. In other words, Cognitive SLA changes not only what we measure, but also who is responsible for it.

We describe in detail how to manage agents whose quality is measured by Agent SLA in the article "Agent Governance — how to manage a swarm of 50 AI agents without losing control."

Five levels of agent autonomy

The scope and rigor of Cognitive SLA depends directly on how much autonomy agents have in the system. CDF 1.3.2 defines five levels of autonomy — from L0 to L4 — which determine the required scope of monitoring, audit frequency, and acceptable escalation thresholds.

Level	Name	Description
L0	Human-only	No AI agent; the human performs the entire task independently.
L1	Human-in-the-loop	The agent makes recommendations, and the human approves each action before it is performed.
L2	Human-on-the-loop	The agent acts independently, the human supervises and can intervene.
L3	Supervised autonomy	The agent operates autonomously with periodic auditing and performance monitoring.
L4	Full autonomy	Full autonomy of the agent; only permitted for non-critical processes.

The higher the level of autonomy, the more important Cognitive SLA and Agent SLA metrics become. At level L1, Confidence Calibration is crucial — because humans must assess whether recommendations can be trusted. At levels L2–L3, the importance of Agent Coordination and automatic escalation procedures increases. Level L4 requires full automation of monitoring, as humans do not participate in the ongoing decision-making process.

Autonomy levels are one of the nine mandatory fields of the Agent Registry — the central registry of agents in CDF. We describe the full agent governance model, including interaction patterns and emergency procedures, in the article "Agent Governance — how to manage a swarm of 50 AI agents without losing control."

Seven metrics that define the quality of an AI system

In Phase 4 of the CDF — i.e., at the time of production deployment — specific Cognitive SLA metrics come into effect. Each of them has a defined goal, measurement method, and assigned level of responsibility.

Metric	What it measures	Purpose	Measurement method
System Availability	Base platform availability	99.9	Infrastructure monitoring (Prometheus, Datadog)
Reasoning Accuracy Rate	Percentage of responses consistent with ground truth	≥95% critical≥90% standard	Evaluation vs. golden dataset + human review sampling
Hallucination Rate	Percentage of responses not confirmed in sources	≤2% critical≤5% standard	Automatic verification vs. knowledge base + random audit
Mitigation Response Time	Time from cognitive error detection to correction	≤15 min critical≤4 h standard	Timestamping of alerts and incident closures
Knowledge Freshness Index	Knowledge base freshness within a defined window	≥95% within a 7-day window	Comparison of document update dates vs. time window
Agent Coordination	Percentage of multi-agent tasks completed without escalation	≥85%	Agent orchestration logging (Immutable Audit Trail)
Confidence Calibration	Correlation of declared confidence with actual accuracy	r ≥ 0.85	Statistical analysis: declared confidence vs. actual accuracy

Two things are worth noting. First, the metrics distinguish between critical and standard processes, which means that error tolerance varies depending on the business context rather than being uniform across the entire system. Second, the system measures not only the accuracy of responses, but also how quickly the organization can respond to a detected error — Mitigation Response Time ≤15 minutes for critical processes is a very ambitious goal that forces automation of detection and corrective procedures.

Confidence Calibration — a metric that few people talk about

Of the seven Cognitive SLA indicators, one deserves a separate discussion because it is rarely seen in market practice: Confidence Calibration.

It answers the question: is a model that claims 90% confidence actually correct in 90% of cases? If not — if the model signals high confidence but is actually wrong much more often — the user loses the ability to meaningfully evaluate the response. They don't know when to trust the recommendation and when to verify it.

CDF measures this using a statistical correlation between declared confidence and actual accuracy, with a target of r ≥ 0.85. This is not an academic metric — it has a direct impact on whether AI supervisors can make accurate decisions about when to trust the system and when to seek additional verification.

Confidence Calibration is directly related to the quality of human oversight. In the article "Human Competence Gate," we describe a mechanism that verifies whether the person approving an AI recommendation actually understands what they are approving — and how model confidence calibration affects that ability.

What happens when the metric drops

Numerical targets alone are not enough. What an organization does when a target is not met is equally important. That is why CDF defines a three-step escalation procedure.

Yellow — the metric falls below the target for 24 hours. An automatic alert is sent to the CogOps team. This is an early warning signal: something is happening that requires observation, possibly a configuration adjustment or a knowledge base refresh.

Orange — the metric remains below the target for 72 hours. Escalation to the architecture management level, mandatory root cause analysis. This is no longer a temporary drop, but a systematic problem that requires understanding the cause.

Red — metric below target for 7 days, or Hallucination Rate exceeds 5% in a critical process. The consequences are serious: Agent Kill-Switch activated, executive level notified, and remediation plan implemented within 48 hours.

This procedure is important for two reasons. First, it turns the general feeling of "something is not working" into a defined protocol with assigned roles, escalation, and response time. Second, it links Cognitive SLA with Agent Governance — the Red level can directly trigger the Kill-Switch, i.e., the physical or software shutdown of an agent or an entire swarm of agents with cascading notification of dependent nodes.

CDF defines three types of Kill-Switch: Single Agent, Swarm, and Cognitive Circuit Breaker. Their full description, along with the Agent Governance Model, can be found in the article "Agent Governance — how to manage a swarm of 50 AI agents without losing control."

Cognitive Quality Reports — continuous evidence, not a one-time test

Cognitive SLA metrics are not measured once and then shelved. CDF provides monthly Cognitive Quality Reports that present metric results, reasoning quality trends, cognitive incidents, and optimization recommendations.

This is important from a management and compliance perspective. The organization does not have to rely on the supplier's declaration that "the system works well." It receives a cyclical, auditable report showing specific numbers: how accurate the reasoning was, how many hallucinations were detected, how quickly incidents were responded to, and whether the knowledge base was up to date.

In regulated environments—finance, public administration, energy, defense—such a report is not a luxury, but a necessity. Regulators are increasingly asking not whether AI is implemented, but how the organization measures and manages its quality over time.

Monthly Cognitive Quality Reports are part of the CogOps phase — a continuous maintenance service, which is described in detail in the article "Cognitive Operations — what happens after implementation, when most AI providers have long since left the building."

Why this changes the conversation with your AI provider

Most AI implementation contracts include infrastructure SLAs: uptime, support response time, maintenance windows. This is necessary but insufficient, because the entire quality description focuses on the technical layer.

Cognitive SLAs take the conversation to a higher level. Instead of asking, "Will the system be available?", an organization can ask, "What percentage of responses will be consistent with our knowledge base?", "How quickly will you respond to the detection of hallucinations in a critical process?", and "Who specifically is responsible for the quality of reasoning, and who is responsible for coordinating agents?"

These are questions that many providers do not have good answers to today. Not because they are dishonest, but because the AI implementation market still operates on infrastructure categories and has not developed universally accepted standards of reasoning quality.

A few questions worth asking

Before signing a contract for the implementation or maintenance of an AI system, it is worth checking:

Does the supplier measure the quality of reasoning or only the availability of the platform?
Are there separate metrics for critical and standard processes?
What is the procedure when the Hallucination Rate exceeds the set threshold?
Does the organization receive regular cognitive quality reports with specific figures?
Who on the supplier's side is responsible for reasoning quality — DevOps, data science, or a dedicated CogOps team?
Is there an emergency mechanism to stop the agent if the metrics fall permanently below the target?

If the answers to these questions are unclear or boil down to a general "we monitor the system," then we are most likely talking about a classic infrastructure SLA without a cognitive layer. This may be sufficient for simple support tools. For a system that participates in business processes and supports real decisions, it is usually not enough.

Cognitive SLAs do not replace traditional IT metrics. They complement them with what is most important and most difficult to measure in AI systems: the quality of reasoning, the speed of response to errors, and the ability to maintain that quality over time.

In the next article, we will show how CDF eliminates Pilot Purgatory and defines the path from pilot to production in 90 days — with exit criteria, Production Cost Model, and Scale Path Definition, which ensure that a prototype does not remain a prototype forever.

1 Comment

Marcin Kaźmirak

Apr 07

Cognitive SLA is a compelling approach to a problem that's easy to overlook. There's a natural tendency to assume that if a system is "running", it's running correctly — but with AI those two things can be entirely independent of each other. The fundamental difference from classical IT systems lies in the probabilistic nature of AI models: we never have full visibility into the reasoning an agent applied when making a specific decision. In traditional software, a bug is deterministic and reproducible — you can trace it and fix it. In AI, an error might surface once every twenty responses, in an unpredictable context, and remain completely invisible if you're only measuring uptime. It's good to see methodologies like CDF…