Mar 2026

Why I stopped caring about model benchmarks.

Every few weeks a new model drops and the benchmarks are impressive. Reasoning scores up. Context length doubled. Outperforms the previous leader on MMLU by four points. The discourse spends a day debating what it means.

I used to pay close attention to this. I do not anymore, because after building enough production systems I have noticed that model choice is rarely the variable that determines whether something works.

Benchmarks measure what models know. Production measures what systems do.

A benchmark tests a model in isolation: give it a question, evaluate the answer. That is a clean measurement of raw capability. It is also almost nothing like the conditions of a deployed system.

In production, the model receives context assembled by a retrieval system that may or may not surface the right information. It operates under a prompt that may or may not be well-constructed for the task. It produces output that gets parsed by downstream code that may or may not handle edge cases gracefully. It runs inside infrastructure that introduces latency, rate limits, and failure modes.

The model is one node in a system. Benchmarks tell you about that node in a vacuum. They tell you almost nothing about how the system will behave.

The variables that actually move the needle

When a production system underperforms, I almost always find the root cause in one of four places before I ever get to the model.

Retrieval quality. What gets surfaced into the context window is the single biggest determinant of output quality in RAG-based systems. A mediocre model with excellent retrieval outperforms a state-of-the-art model working from garbage context, consistently, across every use case I have tested this on.

Prompt architecture. The difference between a well-structured prompt and a poorly structured one is often larger than the difference between model generations. Specificity of instructions, format of examples, explicitness of constraints. These are engineering decisions that compound.

Output schema design. Asking a model to produce structured output without a well-designed schema produces inconsistent results. Defining exactly what valid output looks like and building validation around it removes an entire class of production failures.

Evaluation coverage. Most teams do not know when their system degrades because they have not defined what good output looks like precisely enough to measure it. Without evals, model updates, prompt changes, and retrieval shifts happen without anyone knowing their effect.

When model choice actually matters

There are real cases where the model is the constraint. Tasks that require long multi-step reasoning chains. Problems with genuine ambiguity that a smaller model handles poorly. Cost thresholds where you need to use a cheap model and a capable one interchangeably based on complexity.

These cases exist. But they come up after everything else is solid. A team chasing the latest model release before they have good retrieval, good prompts, and good evals is solving the wrong problem. The ceiling they are hitting is not the model. It is the system around it.

Fix the system. Then worry about the model.

Back to jakehbradley.com