Skip to main content
The journal
EngineeringMay 20269 min

The latency budget is back, and the systems we built while it was gone are showing it.

For two years, response time was the most forgiven performance problem in enterprise software. Models were novel enough that waiting four seconds felt reasonable. In 2026, those systems sit on the critical path of synchronous workflows where four seconds compounds into thirty minutes of daily waiting. Here is what changed, where the architectural debt lives, and what we are rewriting across fourteen production deployments.

By
Sher Ghan
Principal AI Engineer
The latency budget is back, and the systems we built while it was gone are showing it.

The system had been in production for four months when the operations lead ran a time study. Her team of eleven reviewers was processing insurance claims using the model's output as their primary decision signal — the model surfaced relevant policy clauses, flagged anomalies in the claim history, and produced a one-line recommendation. Recall was 91 per cent. No hallucinations worth logging. User sentiment surveys from the first three months were the strongest we had seen in a production deployment of this kind. The time study found that mean response latency from the model was 4.3 seconds. The team processed ninety items per reviewer per morning shift. That is 387 seconds — roughly six and a half minutes — of accumulated waiting, scattered across two hours of work. By month four, reviewer throughput had fallen 9 per cent from the pilot baseline. Not because the model was wrong. Because the wait had changed how people worked.

We had not measured this. Neither had the client. Latency appeared in the evaluation framework as a column — we tested it, we reported it, and we noted that 4.3 seconds was within the range that pilot users had accepted. The pilot users were senior team members sitting in meeting rooms watching a demonstration. The morning-shift reviewer with a queue of ninety items and a lunch break at one was not in those rooms, and nobody had done the arithmetic.

§02Why latency was the most forgiven problem in 2024

The honest account of why response time was not a priority in most 2024 enterprise AI deployments is this: the alternative to the model was not a faster system. It was the same human doing the task without assistance. Against that baseline, four seconds was nothing — the model was replacing work that took minutes per item, and the capability was remarkable enough that the wait felt like a fair price. The comparison class made the latency invisible.

That comparison class has shifted. Once a model is embedded in a workflow and trusted enough to be on the critical path, the user no longer compares it to the manual work it replaced. They compare it to every other tool they use in the same workflow — email clients, CRMs, search boxes, document editors. Those systems respond in under a second as a matter of course. The model now looks slow not against the task it replaced but against the environment it lives in. That shift happens quietly, sometime between the pilot and month four, and it is visible only in retrospect.

There was also an architectural reason. Most 2024 pilots were built asynchronous by default, or structured as supplementary tools the user could consult and disregard. You requested an analysis; the result arrived when it arrived; you read it when you were ready. The latency of an asynchronous call is barely a latency at all. The systems that moved from pilot to production in 2025 moved from supplementary to synchronous: the model's output became a gate, not a suggestion. The user cannot proceed until the model responds. In that context, 4.3 seconds is 4.3 seconds per item, compounded across every item in the queue before lunch.

§03The arithmetic nobody ran

The number that matters is not the mean response time. It is the daily latency tax — the total time a user spends waiting for model responses across their workday. It is embarrassingly simple arithmetic that we have never, in six years of enterprise AI delivery, seen included in a pilot evaluation.

Ninety items at 4.3 seconds per item is 387 seconds — six minutes and twenty-seven seconds of waiting in a two-hour session. At forty tasks per day with twelve model calls per task, a 3.8-second mean is 30.4 minutes of daily waiting. Users do not consciously notice 3.8 seconds. They notice the pace of their day feeling slightly wrong. They notice the queue never quite clearing. They notice, by month three, that they are tired in a way they were not before. That is the daily latency tax, and it does not appear in any evaluation dashboard we have ever seen presented at a handover meeting.

P95 latency is the other number that mattered and that we were not reporting. A system with a mean of 3.2 seconds and a p95 of 11.6 seconds is not a 3.2-second system. The items at the tail — the complex claims, the unusual documents, the edge cases that require more retrieval context — are the ones reviewers remember. A system that usually responds in three seconds and occasionally takes twelve will be perceived as slow by every user, because the slow moments are the moments they looked at a clock.

Users do not consciously notice 3.8 seconds. They notice the pace of their day feeling slightly wrong — and by month three, they are tired in a way they were not before.

§04Three decisions where the latency debt accumulated

Across the fourteen production systems we currently manage — eight we built, six we took over from other teams — three architectural decisions appear repeatedly in systems built during the don't-care phase. They were rational choices at the time. They are liabilities now.

The first is the absence of model routing. Every query, regardless of complexity, went to the frontier model — whatever was the headline model when the system was built. This was defensible: the frontier model was the one that passed the evaluation, and routing adds development effort that was not in scope. In the claims system from the opening of this piece, we profiled three months of production queries after taking on management of the deployment. Sixty-four per cent were structural extraction tasks: does this section contain a maximum liability cap, list the exclusions in clause seven. We ran a representative sample through a fine-tuned 8B-parameter model. Accuracy fell by 1.1 percentage points against the client's labelled set. Latency fell by 78 per cent. We shipped the routing layer in September 2025 and mean response time dropped from 4.3 seconds to 2.1.

The second is the absence of caching at the retrieval layer. The embedding model ran on every query. The vector store executed a fresh similarity search on every query. In systems with a large, slow-moving corpus — regulatory guidance, internal policy libraries, product documentation — a significant fraction of queries are semantically near-identical to queries already answered. In a compliance assistant we took on in late 2024, semantic caching at the retrieval layer reduced mean latency from 4.9 seconds to 3.1 in one change — a 37 per cent reduction. The cache hit rate in month one was 29 per cent. By month eight, with the corpus stable, it was 41 per cent.

The third is deferred streaming. In most pilots, streaming complicates evaluation — it is harder to score a response as it arrives than to score it once complete. The practical workaround was to disable streaming during the evaluation period and add a backlog note: streaming before scale. We have inherited five systems where that note was still in the backlog at month eight or later. Streaming does not change time to complete response. It changes perceived latency by making the experience of waiting disappear — the user starts reading at the first token, which arrives in a well-architected system in under 600 milliseconds. In user testing across two deployments, an 800-millisecond improvement in time to first token, with no change in total completion time, produced a 23-point improvement in the user satisfaction score for system speed. That improvement cost four days of engineering. The backlog note had been sitting for five months.

§05What the instrumentation now requires

We now specify latency instrumentation as a delivery requirement, agreed at design time and shipped as part of the deployment package — not added when the operations lead runs a time study.

The mandatory metrics are p95 and p99 latency alongside mean, broken out by query complexity tier where routing is in place. Distributions, not single numbers. The p95 defines the experience for the difficult cases. The p99 is what a reviewer tells a manager about when the system comes up in conversation.

Time to first token separately from total completion time, with both reported in the monitoring dashboard. In streaming architectures these are different measurements with different causes. High time to first token usually means the retrieval layer is slow or the model call setup has unnecessary serial steps. A large delta between time to first token and total completion time usually means generation throughput — which may be the model, the inference provider's current load, or network transfer. Separating these numbers is what makes the fix tractable.

Client-perceived latency measured from the browser, not just server-side instrumentation. We have found gaps of 400 to 900 milliseconds between server-measured and client-perceived latency in three separate production systems, caused by application-layer delays — a spinner that starts 200 milliseconds late, a React state update blocking on an unnecessary dependency, a response sitting in a queue on the application server before being pushed to the client. These are not model problems. They do not appear in server metrics. They appear when a user says the system feels slow and the server logs say it is not.

§06What this is not

This is not an argument for on-device models, for smaller models as a general policy, or for trading accuracy for speed. Model routing is not a performance compromise — it is the application of the right model to the right task, and the right model for a structural extraction task is not the largest available. The task determines the model. The model determines the latency. When the task requires a frontier model, use one.

Nor is it an argument that the frontier models are the problem. They are slower than smaller alternatives for a reason — the additional capacity is real, and the tasks that genuinely need it are real. The problem is not using frontier models. The problem is using them for tasks that do not need them, not measuring the cost of doing so, and then discovering the accumulated cost when the operations lead runs a time study.

The systems built in 2024 without latency discipline are not broken. They are exactly as fast as they were designed to be, which was as fast as they needed to be to pass a pilot review. Retrofitting is feasible — we have done it thirteen times in the last year — but it is not free. The routing layer, the caching layer, the streaming interface, the instrumentation: each is roughly a week of engineering added against a system with production traffic. The same work, specified at design time, is two to three days. We know this with some precision now because we have done it both ways.

About the author
Sher Ghan
Principal AI Engineer

Every piece in the Journal is written personally by a senior practitioner, drawing on the engagement that motivated it. No ghostwriters, no content team, no models. If a paragraph here resonates with a problem you are looking at, the author is the person to reply to — direct lines beat anonymous inboxes.

Get in touch with the practice