The case for retrieval-first architectures over fine-tuning, in seven failed projects.

The instinct, when a client says 'our data is special', is to fine-tune. The instinct is wrong more often than it is right. Over the last three years we have worked on twenty-two projects where fine-tuning a frontier model was on the table. We chose to fine-tune in five. Of those five, two delivered enough lift to justify the cost; three quietly came back to a retrieval-augmented architecture within twelve months.

This piece is not a polemic against fine-tuning. There are problems for which it is the only sensible answer. It is, however, an argument that fine-tuning should not be the default — and a description of the seven projects that made us believe that.

§02What retrieval gets you that fine-tuning does not

A retrieval-augmented system separates the model from the knowledge. The model does the language; the index does the memory. That separation is what makes the system maintainable. Knowledge changes weekly in most enterprises. Models change yearly. If you have welded those two timescales together by fine-tuning, every knowledge update is now a model update — and every model update is now a regression risk to your knowledge.

Retrieval also gives you something fine-tuning never will: citations. The system can show its working. In any context where someone might one day ask 'why did the model say that?', the answer cannot be 'because the weights say so'. It has to be a page, a clause, a paragraph. Retrieval gives you that for free.

§03Project 03 — the pharma client we should have known better with

An early engagement, eighteen months in. We fine-tuned a 7B-parameter model on a regulatory submission corpus. The results in evaluation were exquisite. The results in production lasted four weeks, which is how long it took for the regulator to publish a guidance update. We had welded the model to a snapshot of the world. The retrofit to a retrieval-augmented architecture took six weeks; the fine-tune was abandoned.

“We had welded the model to a snapshot of the world. Knowledge changes weekly. Models change yearly. Do not weld those timescales together.”

§04Project 11 — when fine-tuning genuinely was the answer

A trading desk needed sub-50ms latency for a structured extraction task on a small, slow-moving schema. Retrieval-augmented architectures could not hit the latency budget; the round-trip to a vector store was the bottleneck. We fine-tuned a small open-source model on roughly 90,000 labelled examples and ran it on-premise. Eighteen months later it is still in production and the case for it is unchanged. The schema has barely moved.

The pattern: fine-tune when the schema is stable, the latency budget is tight, and the task is narrow. Reach for retrieval in every other case.

§05The decision tree we use now

Three questions, in order. Will the knowledge change inside the next twelve months? If yes, retrieve. Is the latency budget under one hundred milliseconds? If yes, fine-tune is on the table. Will someone, sometime, need to know why the model said something? If yes, retrieve.

Two of those three answers point to retrieval in roughly nine out of ten enterprise contexts. That is not because retrieval is fashionable. It is because most enterprises have knowledge that moves and a regulator who asks questions.

§06A note on the middle ground

There is a third option people often miss: a fine-tuned reranker on top of a retrieval pipeline. It gives you the latency floor of fine-tuning without welding the knowledge to the weights. We use it more often than we use either pure approach. It is unglamorous. It is also, in our hands, the highest-precision option for most enterprise retrieval problems.