Skip to main content
The journal
EngineeringApril 202610 min

What breaks in agentic systems at month seven, and the three structural gates we now require before any deployment.

Agentic AI systems fail in ways that demos do not reveal and evaluations rarely test. After eighteen months of production deployments, three failure modes have become reliably visible — none of them are about model capability, all of them are about architecture. Here is what we have learned to require before we approve a build.

By
Hadia Aslam
Principal AI Engineer
What breaks in agentic systems at month seven, and the three structural gates we now require before any deployment.

The most expensive thing about a misbehaving agentic system is not the token cost. It is the cost of a loop that runs for four hours after the user has abandoned the task, calls a write endpoint thirty-seven times in nine minutes because it misread a rate-limit response, and leaves no structured trace of what it actually did. We have had two incidents like this in production across client deployments in the last eighteen months. Neither was catastrophic. Both were entirely avoidable, and both taught us something about the gap between a system that performs well in evaluation and a system that behaves safely under the conditions real work produces.

This piece is not an argument against agentic architectures. We use them, we build them, and in several client contexts they have delivered returns that no simpler architecture could match. It is an argument that agentic is not a free upgrade from retrieval or from a single-turn pipeline — that the shift to goal-directed, multi-step, tool-calling behaviour introduces a class of failure mode that is nearly invisible in demos and nearly invisible in evals, and that the right response is not more evaluation but different architecture.

§02The first failure mode: cascade in the tool layer

An agentic system with five tools and a goal that spans twelve steps will, eventually, encounter a tool response it was not designed to handle. Not a hard error — a malformed JSON response is detectable and recoverable. The dangerous responses are the plausible-but-wrong ones: a pagination cursor that has expired and is silently returning results from the start of the set, a status code the model interprets as success but that actually indicates a queued state, an API that changed its response schema in a minor version bump and now returns the right fields with the wrong nesting. The model has been trained on patterns that make charitable interpretation the default. If the architecture has no explicit interruption semantics — points in the task graph where the model is required to surface uncertainty to a human before continuing — it will interpret ambiguity charitably and proceed. In a read-only task, that means confident answers with no grounding. In a write-path task, it means committed state that is partially wrong.

We have seen this in three separate production systems. In one case, an agentic pipeline processing legal document amendments had been running correctly for four months when a third-party extraction API changed its response envelope. The pipeline did not detect the schema change — it was not designed to detect schema change — and continued processing for six days before a human noticed the output quality had degraded. Eight hundred and forty-seven documents had been amended before detection. Correcting them took eleven working days.

§03The second failure mode: goal drift at step forty

Long-horizon agentic tasks require the model to hold a goal stable across many steps and many tool outputs. Frontier models are very good at this for tasks that complete in under four or five minutes of wall-clock time. They are considerably less reliable for tasks that involve a hundred or more context events over a two-hour run. The goal is still in the context window; the model has not forgotten it in any mechanical sense. What happens instead is subtler: the framing of the goal shifts as context accumulates. A task stated as 'review the procurement contracts and flag any unusual indemnity clauses' has, by step sixty, been contextually reshaped by all the preceding tool outputs and intermediate reasoning. The model is still trying to be helpful; its working definition of 'unusual' has been silently recalibrated by everything it has already seen.

We have encountered this in two client deployments. In both cases the final output was logically coherent — the system was not hallucinating, it was not confused, it was producing internally consistent work. It had simply drifted from the original intent in a way that would be invisible to anyone who did not read the original instruction alongside the output. In one case a compliance review missed twelve clauses the client would have flagged because the model's definition of 'unusual' had shifted by clause forty-four. The review passed human QA for three weeks before the pattern was identified.

The model has not forgotten the goal. The framing of the goal shifts as context accumulates — silently, coherently, and in a way that passes human QA.

§04The third failure mode: silent partial completion

The loudest failure is the easiest to manage. An exception thrown, a timeout, a hard error surfaced to the user — these produce incidents that are logged, triaged, and resolved. The failure mode that is hardest to catch in a production agentic system is not the loud one. It is the system that completes seventy per cent of a task without error, encounters an obstacle it cannot resolve, and then — lacking explicit termination semantics — either enters a slow degraded loop or exits cleanly with a result that looks complete. The downstream system receives a response. The response looks correct. The missing thirty per cent does not become visible until something that depends on it fails.

In four separate production deployments we have seen this pattern produce incidents requiring manual remediation. Two were in financial operations contexts — one a batch reconciliation pipeline, one an accounts payable agent. In both cases the agentic system had processed all the inputs it could, silently skipped the ones it could not, and returned an output that passed automated validation. The gap became visible downstream when payment runs produced numbers that did not reconcile. Neither incident resulted in financial loss. Both required substantial manual audit to determine which items had been processed and which had not.

§05Three structural gates we require before approving any deployment

We now have a standard gate we apply before approving any new agentic deployment, internal or client-facing. It is three questions, and 'our evaluation suite covers this' is not an acceptable answer to any of them.

First: does the loop have explicit interruption semantics? By which we mean — are there points in the task graph where the model is architecturally required to surface uncertainty to a human before continuing, and is the model's ability to skip those points constrained by the architecture rather than by a prompt instruction? Prompt-level guardrails drift across model updates and across temperature variations. Tool interfaces do not. If the only thing preventing an agent from proceeding on ambiguous state is a paragraph in the system prompt, the system does not have interruption semantics — it has an instruction that will eventually be overridden by context.

Second: is the task boundary observable from outside the loop? The calling system or the human operator should be able to ask, at any moment, what the agent has completed, what it is currently doing, and what it believes remains. Not as a token-level trace — as a structured state representation in the domain vocabulary of the task. 'Processed 44 of 61 contracts; currently reviewing item 45; 7 items flagged for human review' is an observable state. 'Here is the tool call log' is a log. These are not the same thing, and the difference matters at two in the morning when a batch has been running for ninety minutes.

Third: what is the rollback state for every write operation the agent can perform? Soft deletes, event sourcing, an append-only staging table that an agent writes to before a human confirms promotion — the specific mechanism matters less than the existence of one. We have declined three agentic deployments in the last six months because the answer to this question was 'it would depend on the operation'. That is not an answer. That is a description of an incident waiting to happen.

§06Why this is not a prompt problem

All three requirements are architectural, and that is the point. They can be specified, tested, and enforced at the infrastructure level in ways that prompt instructions cannot. The surface plausibility of language model output — the quality that makes demos compelling — also makes the absence of structural guarantees easy to miss. A system that can write a coherent paragraph explaining what it just did is not the same as a system with a structured, queryable state representation. A model that says 'I will pause if I am uncertain' is not the same as a system where pausing on uncertainty is enforced by a tool interface that cannot be bypassed.

The firms that will have reliable agentic systems in production in two years are mostly not the ones producing the best demos today. They are the ones doing the unglamorous work of specifying interruption conditions in their tool schemas, building explicit state machines instead of relying on prompt-level planning, and maintaining rollback capability as a first-class engineering concern. We have learned this from eighteen months of post-incident reviews. We would have preferred to learn it from first principles.

About the author
Hadia Aslam
Principal AI Engineer

Every piece in the Journal is written personally by a senior practitioner, drawing on the engagement that motivated it. No ghostwriters, no content team, no models. If a paragraph here resonates with a problem you are looking at, the author is the person to reply to — direct lines beat anonymous inboxes.

Get in touch with the practice