There is a particular kind of call we have learned to dread. The client says the system feels like it is underperforming — outputs are not quite right, someone in the team has lost confidence, there is talk of returning to manual review for the difficult cases. You pull up the evaluation dashboard. The headline metric looks normal. You ask when the evaluation was last run; the answer is automatic, weekly. You ask against what data; the answer is the same labelled set from launch. That is the moment the problem comes into focus.
After eleven versions of that conversation across our production AI engagements — some in systems we built, several in systems we took over from other vendors for ongoing management — the underlying pattern has become familiar enough to write down. The evaluation a team runs at launch measures the gap between the system and the world at the moment of launch. By month twelve, both the system and the world have shifted. The measurement no longer captures the gap between them. The number looks healthy because the evaluation is no longer examining what it was designed to examine. That is not a model problem. It is a maintenance problem, and most organisations working with production AI have not yet developed the discipline to solve it.
§02Why the launch evaluation becomes the wrong measurement
There are two distinct routes by which a launch benchmark degrades, and it is worth distinguishing between them because they have different remedies.
The first is the more legible: the world changes and the evaluation corpus does not. Source data shifts — regulatory guidance updates, document formats evolve, the organisation's own vocabulary drifts — and the labelled examples from eighteen months ago no longer represent the inputs the system actually processes. Recall falls. Precision erodes. The benchmark, still running automatically against the original set, still reports a number that looks like health. We encountered this directly in late 2024, when we were engaged to manage a post-launch optimisation programme for a document review pipeline in pharma — not one we built — that had been in production for eight months. The original evaluation suite comprised 240 labelled documents drawn from the firm's 2022 regulatory submissions. The pipeline's recall against that set was 94.3 per cent. When we reconstructed a fresh evaluation using 200 documents sampled from the preceding nine months of 2025 submissions, recall was 71.8 per cent. Nobody had sampled from the recent corpus. Nobody had considered it necessary. The system had been missing one in four relevant passages for at least four months before we identified it.
The second mechanism is subtler, and in our experience considerably more common. When a model pipeline is good enough that practitioners genuinely rely on it, it changes what they do. What they do produces the data. The data is what future evaluations are eventually run against. Over time, the human work product downstream of the system begins to reflect the system's prior outputs — not through negligence, but through trust. The loop closes quietly, and nothing in the dashboard reports that it has.
§03Evaluation closure: when the system reshapes its own ground truth
We encountered the second mechanism clearly in a contract analysis deployment for a financial services client, built and launched in February 2025. The launch evaluation was constructed against 300 contracts, annotated by the client's legal team during the design phase — before the system existed, when the annotations were genuinely independent. By October of that year, the same legal team had largely stopped producing first-pass annotations for ambiguous clauses. The system flagged them; the lawyers reviewed and, more often than not, confirmed. By January 2026, when we ran a scheduled review of the evaluation programme, the corpus was still the 300 contracts from February. Any attempt to refresh it would have required those same lawyers to annotate independently — but their annotation instincts had been shaped by nine months of reviewing the system's flags first. We were, in effect, evaluating the system against its own prior outputs. It scored well. That is precisely the problem.
We call this evaluation closure — the condition in which the evaluation has, through ordinary use of the system, lost its independence from the system itself. It does not require a bad system. It requires a system that is used. The threshold for it to occur is not failure; it is competence.
“Evaluation closure does not require a bad system. It requires a system that is used. The threshold for it to occur is not failure — it is competence.”
§04The sectors where this accumulates fastest
Legal document analysis, regulatory compliance, and clinical decision support are the three domains where we have seen evaluation closure develop most quickly. What they share is a structural feature: the cost of being wrong is high enough that practitioners developed thorough review habits from early in the system's life. They validate closely. They defer to the system where it has proven reliable. They annotate disagreements when they arise and confirm agreement when they do not. That carefulness — the quality that makes the deployment feel like a success — is exactly what accelerates the problem. The more thoroughly a practitioner reviews and confirms the system's outputs, the more those outputs propagate into the ground truth that any future evaluation would draw on.
The sectors with the slowest evaluation closure are, perhaps counterintuitively, the high-volume, low-stakes ones: logistics classification, basic document routing, data enrichment at scale. In those settings, the human review loop is thin enough that ground truth stays relatively independent of what the system produces. The lesson is uncomfortable. The harder you have worked to build trust in a high-stakes deployment — the more thoroughly practitioners have integrated the system into their daily work — the more urgently you need an evaluation programme maintained independently of the workflow it supports.
§05The organisational failure that enables it
Every engagement where we have traced this problem to its root has revealed the same structural absence: nobody owns the evaluation after launch. There is usually a team that built and validated the launch evaluation. That team completed a project. The evaluation is a deliverable — it appears in the sign-off documentation, it validates the system's readiness, it is approved and filed. After launch, it runs on a schedule. Nobody has a brief to ask whether the sample is still representative, whether the labels from launch reflect how practitioners currently work, or whether the world the evaluation was built on still resembles the world the system now operates in. The system continues to run; the metric continues to report; the project is considered delivered.
This is not a criticism of any particular team. It is a description of a gap in the ownership model that nearly every organisation working with production AI has not yet filled. The software engineering discipline has change management, dependency tracking, and security patching as named maintenance activities with named owners and defined cadences. Enterprise AI evaluation does not yet have an equivalent. The teams responsible for it move on when the project closes, because that is how project delivery is structured. The gap persists until something downstream fails, which is usually the first time anyone goes looking for who was supposed to be watching.
§06Two requirements before any system leaves our hands now
We have settled on two requirements that go into every production deployment we take responsibility for, written into the delivery agreement before a line of code is in production.
The first is a named owner for the evaluation — not the system, the evaluation — with an explicit brief to review the sampling assumptions every ninety days and flag any evidence of population shift or annotation drift. This is not a data science role. It is a quality role, and the person doing it needs to be close enough to the domain to notice when the inputs the system receives at month fourteen look materially different from what it was designed for. In most enterprise contexts that person already exists. It is whoever owns quality for the underlying workflow. They have simply never previously been asked to own the evaluation, because the evaluation has never previously been treated as an ongoing responsibility.
The second requirement is a population drift monitor, specified and built before launch. Once per month, we compute a distributional comparison between the data the system processed in its first eight weeks and the data it processed in the most recent eight. We are not testing model performance against a labelled set. We are testing whether the world the system operates in has moved far enough from the world it was built for to make the launch evaluation stale. If the divergence exceeds the threshold agreed at design time, the evaluation is flagged for immediate review. The engineering cost of adding this to a deployment is roughly three days. We have included it in every system we manage since the 2024 pharma incident, and it has triggered a review twice — both times before the client noticed anything wrong.
Neither requirement prevents every form of evaluation closure. There are modes of drift — particularly in high-trust settings where practitioners have shifted from independent annotation to ratifying the system's prior judgements — that no sampling programme fully catches. But between them, the named owner and the drift monitor address the two most common causes of silent evaluation rot: the team that built the evaluation moved on, and nobody noticed when the world the system serves began to change. Addressing both is not technically demanding. It is organisationally demanding — which is exactly why it does not happen unless it is written down before launch, with a name attached to it.
