Latency is architecural

Table of contents

Latency is architectural

Most latency comes from retrieval hops, long prompts, and serial tool calls. The model call is rarely the slow part. The pipeline is the bottleneck. Optimise orchestration, not just the model.

Engineers often assume the model is the slow part. It usually is not. The real drag comes from the machinery wrapped around it.

Retrieval hops cost more than you expect

Every vector search, metadata filter, re‑rank, and chunk stitch is another network hop. Do that a few times and half your latency budget has vanished before the model has even seen a token. It is the old "too many microservices" problem wearing a new badge.

Too Many microservices

A system begins tidy, then grows arms and legs. Someone adds a retriever. Someone adds a re‑ranker. Someone adds a metadata filter. Someone adds a chunk stitcher. Each piece looks harmless. Each piece solves a problem. But once they are strung together, the whole thing slows to a crawl.

RAG pipelines follow the same pattern. Instead of ten microservices, you now have ten retrieval hops. Instead of service chatter, you have index chatter. Instead of JSON bouncing around a cluster, you have embeddings and chunks being passed across the network. The labels have changed, but the behaviour has not.

In a microservice stack, services talk to each other all day long. They pass JSON around, wait for replies, retry on failure, and generally keep the network busy. That is service chatter.

In a RAG stack, the same noise comes from your retrieval layer. The actors are different, but the behaviour is the same. Your vector index, keyword index, metadata store, and re‑ranker all talk to each other. They pass embeddings, scores, filters, and chunks back and forth. Each hop is another round trip. Each hop adds delay. Each hop adds another place for things to wobble.

It is chatter because none of it is real work from the user’s point of view. The user wants an answer. The system spends most of its time gossiping between indexes about which chunk might be relevant. It is busy, but not productive.

The point is simple. You have replaced one kind of internal noise with another. The labels have changed, but the cost has not. If you let the retrieval layer grow without discipline, it will behave exactly like an over‑eager microservice mesh. It will talk too much, wait too long, and slow everything down.

Every hop adds latency. Every hop adds a failure mode. Every hop adds mental overhead. Hop latency accumulates in the end-to-end-pipelines. The job becomes debugging the plumbing rather than improving the product. The system becomes sluggish, brittle, and full of odd surprises.

The lesson is the same as it was during the microservice boom. Keep the number of moving parts low. Keep the boundaries clear. Keep the data local whenever you can. If you do not, the pipeline will drag, no matter how fast the model is.

Leaving the process costs you

Vector search is typical for RAG, but it is not the only culprit. Any retrieval layer that reaches across the network will cost you time. It does not matter whether you use a vector index, a keyword index, a hybrid index, or a bespoke store. If you have to leave the process, hit a service, wait for it to return, and then stitch the results back together, you will pay for it in latency.

Long prompts are silent killers

Sending 200,000 tokens into a model is not free. As of April 2026, GPT-5.5 is USD 5.00 per 1 million tokens, so USD 1 for 200k tokens. This might not sound much but if your whole AI system that is made up from multiple pipelines calls OpenAI a thousand times in an eight-hour period, that is one call every 86 seconds, costing USD 1,000 per day. As you introduce features that rely on AI, this cost can balloon.

You pay for tokenisation, network transfer, and ingestion. It is the equivalent of posting a novel every time you want a paragraph back. Shorter prompts are not only cheaper, they are faster and far easier to reason about.

Cloud costs balloon because the pricing model rewards scale until it punishes you. Everything looks cheap at the start. A few API calls here, a small vector index there, a modest GPU for a prototype. Then the system goes live, traffic rises, and the bill climbs faster than the usage graph.

The pattern is predictable. You pay for every hop, every lookup, every token, every gigabyte, and every idle minute. The cloud does not care whether the work was useful. It charges for activity, not value.

RAG pipelines are especially prone to this. Retrieval is chatty. Each query touches several indexes. Each index has its own storage, compute, and network fees. The model call is only one line on the invoice. The real cost comes from the scaffolding wrapped around it.

Costs balloon because the architecture balloons. More hops. More services. More indexes. More caching layers. More background jobs. More monitoring. More logs. Every piece adds a little cost. Together they add a lot.

The cloud makes it easy to scale up, but it does not make it easy to scale down. Once the system is busy, you pay for the peaks, not the averages. You pay for the buffers, the replicas, and the safety margins. You pay for the comfort of not waking up at three in the morning.

The cloud invoice is driven by the highest sustained load, not the gentle baseline you see on a dashboard.

Cloud platforms charge for capacity, not comfort. When traffic spikes, the system scales out. Extra replicas spin up. Buffers grow. Queues stretch. More storage is touched. More network is consumed. The platform does not scale back the instant the spike ends. It holds the extra capacity for safety, stability, and headroom. You pay for that headroom.

The average load might look modest, but the cloud does not bill you on the average. It bills you on the resources that were provisioned to survive the worst ten minutes of the day. If your peak is ten times your baseline, your bill will reflect the peak, not the baseline.

The only defence is discipline. Keep the design lean. Keep the hops few. Keep the data local. Keep the retrieval tight. Keep the prompts short. Keep the pipeline simple. If you do not, the cloud bill will grow faster than the user base, and it will not stop until you force it to.

Serial tool calls turn your pipeline into treacle

If your workflow is LLM → tool → LLM → tool → LLM, you have built a queue, not a pipeline. Everything waits for everything else. It is the same anti‑pattern that made synchronous RPC chains painful in the early microservice era.

A queue and a pipeline look similar on a whiteboard, but they behave very differently once traffic hits them. The distinction matters, because one keeps work moving and the other forces everything to wait its turn.

A queue is a stop‑start system. Each step blocks until the previous step has finished. Nothing can overtake anything else. If one stage slows down, the entire flow backs up behind it. This is what happens when you chain LLM calls and tools in a strict sequence. The second LLM call cannot begin until the tool has replied. The tool cannot run until the first LLM call has finished. The whole thing becomes a single‑file line.

A pipeline is a flow system. Work moves through independent stages that can run at the same time. Stage one can process ithe next item while stage two handles item one. Throughput rises because the stages overlap. The system does not wait for each piece to finish before starting the next. This is how high‑volume systems stay fast even when individual steps are slow.

A queue waits for the whole journey. A pipeline hands work off and moves on.

The handoff is the key. Once a stage can pass work downstream and start the next item without waiting, you have built a pipeline, not a queue.

The problem with LLM → tool → LLM → tool → LLM is that it behaves like a queue. Every step waits for the previous one. There is no overlap, no parallelism, and no slack. One slow tool call stalls the entire chain. It is the same pattern that made synchronous RPC chains painful in early microservice designs. The system is busy, but nothing is flowing.

The lesson is simple. If you want speed, build a pipeline. If you build a queue, do not be surprised when everything crawls.

4. Orchestration overhead accumulates

Glue code, JSON wrangling, retries, fallbacks, schema checks, and all the other dull bits. Each one is tiny. Each one feels harmless. Together they slow the system more than any single model call ever will.

The overhead hides in plain sight. A few milliseconds to validate a schema. A few more to serialise a payload. A few more to deserialise it. A few more to retry a flaky call. A few more to merge two partial results. None of these steps look expensive on their own. They are not. The cost comes from the fact that you do them on every request, across every stage, under load.

This is why orchestration overhead is so deceptive. It does not arrive as one big hit. It arrives as a hundred small ones. It is death by a thousand cuts. The pipeline spends more time preparing to do work than doing the work.

The worst part is that this overhead grows with complexity. Add one more tool call, and you add one more round of serialisation. Add one more fallback, and you add one more branch to evaluate. Add one more schema, and you add one more validation pass. The system becomes a tangle of tiny chores.

This is usually where the real time goes. Not in the model. Not in the vector search. Not in the database. In the glue. In the stitching. In the invisible admin that surrounds every step. The only fix is discipline: fewer hops, fewer formats, fewer retries, fewer moving parts. The less you orchestrate, the faster everything becomes.

The model is rarely the bottleneck

Modern inference is GPU‑accelerated and heavily optimised. Your RAG stack is a distributed system full of I/O, hops, and blocking calls. Optimising the model while ignoring the pipeline is like tuning the engine while the tyres are flat. The power is there, but the car still drags.

Modern LLM inference is brutally efficient. The kernels are fused. The memory access patterns are tuned. The batching is tight. The GPUs run flat out. The model is rarely the slow part. It is the most optimised component in the entire stack, because it has to be. Vendors pour millions into shaving microseconds from calculation paths.

Your RAG pipeline is the opposite. It is a distributed system stitched together from storage calls, network hops, serialisation steps, retries, and blocking operations. Every part of it waits for something else. Every hop crosses a boundary. Every boundary adds latency. The model is a rocket engine bolted to a shopping trolley.

This is why polishing the model is the wrong instinct. You can shave 10 percent off inference time and never notice it, because the pipeline is burning that time several times over in glue code and I/O. The GPU is idle while your retriever fetches chunks. The retriever is idle while your re‑ranker waits for a schema check. The re‑ranker is idle while your orchestrator serialises JSON. The whole system is dominated by the slowest, least optimised parts.

The handbrake is the pipeline. The bonnet is the model. Shining the bonnet does not make the car move. Releasing the handbrake does. If you want real speed, you fix the hops, the queues, the blocking calls, the retries, the formats, and the orchestration. That is where the time goes. That is where the wins are.

Throughput beats single‑query latency

In a real system, throughput matters more than shaving a few milliseconds off a single request.
Throughput keeps queues short, users calm, and servers steady.
A system that flows well will always outperform a system that only looks fast in isolation.

A design that includes:

  • parallel retrieval
  • batched vector queries
  • cached embeddings
  • pre‑computed context
  • non‑blocking tool calls

will outrun a "fast" single‑query setup every day of the week.

Think like a backend engineer, not a demo builder.
Design for flow, not fireworks.

Evaluation must be continuous

LLM behaviour drifts. Model updates shift outputs. Data changes. Prompt templates evolve. Retrieval indexes age. Static tests decay. Continuous evaluation with real traffic patterns is the only stable approach.

LLMs are not fixed points. They are moving systems. Vendors update weights. Safety layers change. Tokenisers shift. Even subtle adjustments can alter how a model interprets a prompt or ranks retrieved context. A test that passed last month can fail today without any change in your code.

Your data is not fixed either. Documents are added, removed, rewritten, or re‑indexed. Embeddings drift as models change. Metadata grows stale. A retrieval query that once surfaced the right chunk may surface something weaker six weeks later. The index ages, and the quality of the answer ages with it.

An embedding will turn a sentence into a list of numbers where similar items end up close together.

Prompt templates evolve as well. You tweak wording. You add guardrails. You change formatting. You introduce new variables. Each change shifts behaviour in ways that are hard to predict. A small edit can ripple through the entire pipeline.

Static tests cannot keep up with this movement. They freeze expectations in time. They assume the system is stable. It is not. The tests decay because the system they measure is drifting underneath them. A green test suite can give a false sense of confidence while the live system quietly degrades.

The only reliable approach is continuous evaluation with real traffic patterns. You must measure quality under the same conditions the system actually faces: real prompts, real retrieval noise, real user phrasing, real edge cases, real load. Automated reality is required. This is the only way to detect drift early and correct it before it becomes visible to users.

The system is alive. The evaluation must be alive with it.

Guardrails must be layered

No single guardrail is enough. Combine input checks, retrieval filters, prompt constraints, output checks, and post‑processing. Each layer catches different failures. One layer alone invites outages.

Guardrails fail for different reasons. Input checks catch malformed or hostile queries, but they cannot see what retrieval will surface. Retrieval filters remove unsafe or irrelevant chunks, but they cannot stop a prompt template from mis‑framing the task. Prompt constraints shape model behaviour, but they cannot guarantee the model will obey them under stress. Output checks catch violations after the fact, but they cannot prevent the model from producing them in the first place. Post‑processing can clean up structure, but it cannot repair a fundamentally wrong answer.

Each layer has blind spots. Each layer has failure modes. Each layer protects a different part of the system. When you stack them, the gaps do not align. When you rely on one, the gaps are exposed.

This is why single‑layer safety is fragile. A lone input filter cannot stop a retrieval glitch. A lone output checker cannot stop a prompt injection. A lone prompt template cannot stop a malformed chunk. A lone retrieval filter cannot stop a model hallucination. Outages happen when one layer is asked to do the job of five.

A robust system uses layered defence:

  • input validation to reject malformed or hostile queries
  • retrieval filtering to control what context enters the model
  • prompt constraints to shape behaviour and reduce ambiguity
  • output checks to enforce structure and detect violations
  • post‑processing to normalise, redact, or correct

None of these layers is perfect. Together they are resilient. That is the point. Modern LLM systems fail in many small ways, not one big way. The only stable approach is to catch small failures early, often, and repeatedly across the pipeline.

The future is orchestration

The next wave is not bigger models. It is coordination across many specialised models. It is managing context across workflows. It is building predictable tool‑calling chains. LLMs are components now. The engineers who master orchestration will shape what comes next.

The era of single‑model systems is ending. One large model trying to do everything is slow, expensive, and brittle. The future is a network of smaller, focused models: one for retrieval, one for classification, one for planning, one for extraction, one for reasoning, one for generation. Each model does one job well. The value comes from how they work together.

This shift changes the engineering challenge. It is no longer about squeezing more tokens per second out of a GPU. It is about coordinating dozens of moving parts without losing context, consistency, or latency. You must track state across hops. You must pass partial results between models. You must ensure that tools are called in the right order, with the right schema, at the right time. You must keep the pipeline flowing even when individual components fail or drift.

Context management becomes a first‑class problem. You cannot rely on a single prompt to hold everything. You need shared memory, structured state, and workflow‑level constraints. You need to decide what each model should know, what it should not know, and how to hand off information cleanly. The system must behave like a team, not a monolith.

Tool‑calling becomes a discipline of its own. You need predictable chains, clear contracts, and stable interfaces. You need to design workflows that are parallel where possible, serial only where necessary, and resilient everywhere. The orchestration layer becomes the real engine of the system.

This is why the next wave belongs to engineers who understand distributed systems, workflow design, and pipeline optimisation. The models are powerful, but the power is unlocked only when they are coordinated. The future is not a bigger brain. It is a well‑run organisation of smaller brains working together.

Conclusion

Latency in LLM systems is dominated by architecture, not model speed. Most of the delay comes from retrieval hops, network boundaries, prompt expansion, and token‑level generation, so performance improves when you redesign the pipeline, not when you tweak the prompt. Once you see this, it becomes obvious that long prompts, scattered retrieval, and unnecessary round‑trips are the real cost drivers, and that reducing latency means reducing work, not asking the model to work faster.

The practical conclusion is that throughput and batching matter more than single‑query latency, retrieval must be minimised and localised, and prompts must be aggressively shortened. Systems that treat latency as an architectural problem become predictable and scalable; systems that treat it as a model problem stay slow no matter which model they plug in.

You can process the same amount of data while using fewer hops, fewer round‑trips, using fewer tokens, and making fewer retrieval calls, fewer prompt expansions, and fewer model invocations.

It is not about shrinking the task. It is about shrinking the machinery required to accomplish it.

You keep the data volume the same, but you redesign the path so the system touches that data:

  • fewer times
  • in fewer places
  • with fewer transformations
  • with fewer tokens
  • with fewer model calls

Same data, less orchestration. That is why latency drops.

Related Work

If this piece was useful, you’ll appreciate the free Phroneses newsletter — clear thinking on engineering leadership, organisational clarity, and reliable systems. Practical, honest, and built for people who care about doing the work well.

Subscribe to the newsletter →

I work with leaders and teams on clarity, capability, and momentum. Work with me →

Table of Contents

\