The context window is not a retrieval architecture.

The pitch arrives in a recognisable shape. The client has a large internal corpus — policy libraries, contract archives, product documentation, regulatory filings — and the engineer presenting the architecture says: we do not need a retrieval pipeline. The new models handle a million tokens. We will load the corpus into context. The demo is compelling, because the demo always is: the document loads, the model reads it, the question gets answered. In the meeting room this looks like an architectural decision made. It is a deferral.

We have now seen this approach attempted — in varying degrees of seriousness, from proof-of-concept to fully shipped — in fourteen of the thirty-one enterprise AI engagements we have run or inherited since early 2025. In three, the team had already shipped to production by the time we were engaged to manage or optimise the system. What we found across those three deployments was consistent enough to write down: costs running at twelve to fifteen times what a retrieval architecture would have cost; latency floors that had invisibly suppressed usage; and a quality failure that had been passing QA for months because nobody had thought to test for it.

#02The reasoning is not stupid

The instinct to reach for large context has become more defensible with each model generation, and dismissing it without engaging with why would be a different kind of carelessness. Context windows expanded fast and in ways that matter: GPT-4.1 and Gemini 2.5 Pro crossed a million tokens in early 2025; the Anthropic models extended to 200,000 tokens before that. Alongside the raw expansion, the model families improved materially at what had been the decisive counterargument: the 'lost in the middle' failure, where models trained on long contexts reliably underweighted information appearing far from both ends of the prompt. That failure is no longer as clean an argument as it was in 2023. The models genuinely improved at long-range attention, and it is reasonable for an engineer to notice this and revise their priors accordingly.

What the demo does not test is the production regime: a corpus that grows as the organisation produces documents; query volumes that compound cost and latency at scale; tail-case queries requiring synthesis across many passages at varied positions; and the debugging task that begins when a user reports a confidently wrong answer and nobody can point to which part of the million-token context the model was attending to when it produced it. Each of those is a production problem. None of them appears in a forty-five-minute demonstration against a curated set of examples.

#03The cost arithmetic, and the behaviour it suppresses

The first problem to become legible after launch is money. A million-token frontier-model call costs between £0.08 and £0.25 per query, depending on provider and input-to-output ratio. A retrieval-augmented call against the same model, passing 8,000 to 12,000 relevant tokens, costs between £0.002 and £0.008. The ratio at the midpoint is roughly fifteen to one.

In an internal analytics assistant we took over management of in the first quarter of this year, the team had built a 340,000-token context from the company's consolidated knowledge base — internal policy documents, historical analysis, product specifications, and two years of archived reports. Mean call cost was £0.19. Volume was approximately 4,400 calls per day across a team of 190 analysts. Monthly API spend was £24,700. We rebuilt on a pgvector retrieval index. Mean context per call fell to 9,800 tokens. Monthly API spend became £1,890. The thirteen-to-one cost reduction paid for the retrieval engineering in twelve working days.

What the client had not measured — and what we measured after the rebuild — was the secondary effect of the cost on usage behaviour. The long-context system was slow enough and carried enough implied friction that the team had unconsciously throttled their engagement with it. Analysts who might naturally have made seven or eight model calls to work through a problem were making three. After the retrieval rebuild, mean calls per analyst session climbed from 3.1 to 6.8. The change in usage patterns had downstream effects on output quality that did not appear in any cost analysis but were visible in the work the analysts produced.

#04Latency at the tail

Cost and latency compound separately. A 300,000-token prefill takes time regardless of how the provider bills it. Typical first-token latency for a 300,000 to 400,000-token context against a frontier model runs between 9 and 24 seconds under normal load — at peak we have measured 31 seconds on a congested inference endpoint. Teams manage this with streaming, and streaming genuinely helps for open-ended question-and-answer interactions where the user reads as the model writes. It does not help for tasks requiring a complete result before the user can proceed: 'enumerate every clause in this contract that places an obligation on the counterparty.' The user is waiting for the enumeration. Streaming it does not change when it is available.

The subtler latency problem is that long-context architectures embed a constant floor latency that compounds invisibly across a working day. An analyst making 6.8 calls per session at a mean first-token latency of 14 seconds accumulates 95 seconds of waiting before reading a word of output. At the retrieval system's mean first-token latency of 1.4 seconds, the same 6.8 calls accumulate 9.5 seconds. Neither number sounds significant in isolation. Over twelve weeks, one produces a tool that feels responsive and the other produces a tool that feels like effort — and the difference in how it feels is the difference in how much it gets used.

#05The quality failure that passes QA

The cost and latency problems surface quickly enough that a reasonably attentive operations function will find them. The quality failure is subtler, and it is the one that will have been running for months before anyone identifies it — because it is the failure mode that passes standard evaluation.

Frontier model attention is not uniform across a million-token context. The models improved substantially at the pathological middle-distance problem, but improvement is not uniformity, and in a large document corpus you rarely have one relevant passage. You have several, at different depths, sometimes contradicting each other. The model must reconcile them. When it does so correctly, the result looks like good synthesis. When it does so incorrectly, the result looks exactly like good synthesis.

In a compliance review assistant we were engaged to audit in late 2025 — not a system we built — the long-context architecture had been passing QA for eight months against an evaluation suite drawn from the same corpus the system served. We ran an adversarial evaluation: we identified twelve facts in the corpus for which contradictory information existed at materially different positions in the context window, then queried the system on each. The model returned the correct fact in 68 per cent of cases, returned the contradicting fact without flagging the contradiction in 24 per cent, and produced a synthesis incorporating elements of both in 8 per cent. The QA suite had not tested for contradiction handling, and the corpus had not been audited for internal contradiction. Neither is unusual. Neither would have been a design gap in a retrieval-augmented architecture, where the retrieval layer assembles the passages the model needs and surfaces contradictions by the fact of returning them alongside each other. We rebuilt the retrieval layer with pgvector and a semantic re-ranker. The same adversarial set, evaluated against an 8,200-token retrieved context, returned correct contradiction flagging in 91 per cent of cases.

“When the model reconciles contradicting passages correctly, the result looks like good synthesis. When it does so incorrectly, the result looks exactly like good synthesis.”

#06When a large context window is the right answer

This is not an argument against large context windows as a capability. There are tasks for which they are not merely useful but necessary: multi-document legal analysis requiring deep cross-referencing across instruments that cannot be meaningfully chunked; code comprehension across large interdependent repositories where the model needs line-level context from files that do not surface in semantic search; clinical timelines where medications, diagnoses, test results, and clinical notes must be held simultaneously, and where fragmenting them would destroy the clinical picture. In those settings, a large context window is what enables the task.

The point is to reach for large context when the task requires it rather than because it is the path of least architectural resistance. In the analytics assistant above, we profiled 90 days of production queries after the rebuild. Seventy-eight per cent retrieved twelve or fewer distinct source passages. Ninety-four per cent retrieved thirty or fewer. The knowledge base contained 1,400 documents. Ninety-four per cent of queries were being answered with material from fewer than three per cent of the corpus. A 340,000-token context had been loaded on 4,400 calls per day to find answers that lived, on average, in 0.9 per cent of what it contained.

#07Three questions before we accept a long-context design

We now ask three questions before accepting a long-context design into any system we will be responsible for in production.

What fraction of the context is expected to be relevant to a typical query? If the answer is below 15 per cent, the system is using the model's attention mechanism to perform retrieval rather than performing it efficiently outside the model. At a fifteen-to-one cost ratio, with first-token latency floors in the 9-to-24-second range, a system that retrieves answers from 0.9 per cent of its context is paying an appropriate price for roughly none of its queries.

Can the system show its working? A long-context call returns a response; it does not return a citation. The model cannot tell the user which passage it was attending to when it produced a particular sentence, which means the user cannot audit the answer, the engineer cannot debug a failure from first principles, and the evaluation cannot distinguish retrieval from confabulation. Retrieval gives you citations without additional engineering. Long context never will.

What happens at p95 query complexity? QA suites run on representative queries. Long-context quality problems manifest at the tail: queries requiring synthesis across many passages at varied positions, queries where relevant material is contradicted elsewhere in the context, queries on information that genuinely resides in the middle of a very long document. If the evaluation has not included systematic adversarial middle-context testing and contradiction probing, it has not tested the failure mode it is most likely to encounter.

The pattern we have seen is repeatable. Demo runs on fifty to a hundred documents; model performs well; team concludes retrieval is unnecessary; system goes to production against the full, growing corpus; costs compound; quality erodes at the tail; the client notices the tail and asks why. We have rebuilt four long-context deployments to retrieval architectures since January 2025. In all four, the rebuild took between three and five weeks, produced a system that was cheaper, faster, and more auditable than what it replaced, and freed the team from managing the compounding symptoms of a foundational decision they had not realised they were making. We would have preferred to spend those weeks on new work.