Large language models behave like probabilistic components, not deterministic functions. The interface is the token stream, not the text you see on screen. If you do not design at the token level, you will ship systems that behave differently under load, in production, or with slightly altered inputs.

Token boundaries decide structure, predictability, and stability. Two prompts that look identical to you can tokenize differently and produce different behaviour. This is why reliable AI systems depend on structured interfaces, strict schemas, and disciplined prompt design rather than intuition or style.

The real interface: tokens, not text

Tokens shape what you can build. They decide how much context you can fit in, how fast the model responds, and how predictable the output is.

Token boundaries also change how the model interprets structure. Two prompts that look identical to you may tokenize differently and produce different behaviour.

When you design prompts, AI input or output schemas, or retrieval pipelines, you are really designing token flows. If you ignore tokens, you end up shipping features that behave one way in tests and another way in production.

Prompt A: "Summarize the user login flow."

Prompt B: "Summarise the user login flow."

To a human, the difference is not consequential. To a tokenizer, there is a critical difference.

"Summarize" and "Summarise" break into different token sequences.

The model’s internal statistics for each spelling differ.

The model may shift tone, structure, or level of detail.

And downstream formatting can change because the token pattern changed.

Prompt A: "List the steps to deploy the service."

Prompt B: "List the steps to deploy the service ."

The only difference is a space before the full-stop.

Prompt A ends with a single token for "service."

Prompt B ends with two tokens: "service" and "."

That tiny shift can change the model’s prediction path.

The model is not the system

Most failures blamed on models usually come from everything wrapped around them. In practice, the weak points look very familiar to any engineer who has shipped a distributed system.

Retrieval pipelines drift because indexes age, embeddings shift, and data freshness is rarely monitored. A model can only answer the question you actually retrieved, not the one you meant to retrieve.

Prompt templates collapse under odd inputs because they are often treated as static strings instead of executable logic. One unexpected newline or a missing field can break the entire chain of reasoning. Data freshness and data cleansing is key here.

If this is useful, the free newsletter goes deeper. It is written for people who follow this work closely, and it includes pieces that never appear on the site. Subscribe

Guardrails

Guardrails miss edge cases because they rely on pattern matching, not semantic guarantees. A single unhandled phrasing can bypass a rule that looked airtight in testing.

Imagine you build a guardrail that blocks requests containing "delete all users". It works in tests, so you ship it.

Then a real user sends: "can you delete all the users" or "please delete every user" or "remove all user accounts"

Your guardrail only catches the exact phrase it was written for. It matches strings, not meaning. One phrasing slips through, and the model executes a path you thought was protected.

Many guardrails end up acting like string comparisons even when they use embeddings or classifiers. They match surface patterns, not intent. If the phrasing shifts, the guardrail often fails.

For example, a rule might block "delete all users" because that exact pattern was seen during testing. But the same system may allow "remove every user account" because the embedding distance is just far enough to slip past the threshold.

This is the same failure mode as brittle input validation. If your rules depend on matching specific strings or narrow patterns, you get a system that behaves safely in tests and unpredictably in production.

You cannot solve this by telling the model “if a request is like 'delete all users', refuse to do it”. That feels intuitive, but it fails for the same reason input‑validation-by-string-match fails in any other system.

A prompt can describe the rule, but it cannot enforce the rule. The model will try to follow the instruction, but it has no semantic guarantee. It can still be persuaded, confused, or bypassed by a phrasing it has not seen before.

To actually solve this, you need layered controls outside the model:

Treat the model as untrusted. Never let it directly execute destructive actions. Put a permission layer between the model and anything irreversible.
Normalise user input before it reaches the model. Collapse phrasing, remove fluff, and classify intent. This gives you a stable signal instead of raw text.
Use a separate classifier or rules engine to detect dangerous intent. This component should be simpler, more predictable, and easier to test than the model itself.
Require explicit confirmation for destructive operations. The model can propose an action, but a deterministic system must approve it.
Log every step. When something slips through, you need to see the input, the normalised form, the classification result, and the model’s output.

The prompt can express the policy, but the system must enforce it. If you rely on the model alone, you are depending on pattern matching. If you build a layered pipeline, you get behaviour you can reason about, test, and trust.

Observability

Observability is weak because most systems log the request and the response, but not the context, the retrieval set, the template expansion, or the decoding parameters. When working with LLMs, without the context, retrieval set, template expansion and parameter decoding, debugging is guesswork.

An LLM is at the centre of a much larger system

The LLM is only one component. The system around it decides whether your product behaves like a tool or a slot machine. Engineers who treat the whole pipeline as a software system, not a magic box, build the reliable systems.

Determinism is a design choice

LLMs are probabilistic, but stability is possible. Temperature and top‑p control variance. Structured outputs reduce drift. Deterministic decoding is often more reliable than clever prompts. Treat randomness as a resource you allocate.

Temperature stretches or compresses the probability distribution. Top‑p chops off the tail of the distribution.

Temperature

As temperature increases, the LLM becomes more willing to pick lower‑probability tokens, which effectively means the "token candidate set" gets larger.

More accurately, low‑probability tokens get boosted, high‑probability tokens get flattened.

This means: the model is less confident, more tokens become available, and he sampling process has more room to explore. The next token is drawn from a wider effective set

Top-p

Top‑p (also called nucleus sampling) restricts the model to sampling only from the smallest set of tokens whose cumulative probability is ≥ p.

Think of it as a probability mass cutoff.

Example

Suppose the model predicts the next‑token distribution like this:

Token	Probability	Cumulative
A	0.40	0.40
B	0.25	0.65
C	0.15	0.80
D	0.10	0.90
E	0.05	0.95
F	0.05	1.00

Sorted by probability, cumulative mass builds like this:

A → 0.40 A+B → 0.65 A+B+C → 0.80 A+B+C+D → 0.90 A+B+C+D+E → 0.95 A+B+C+D+E+F → 1.00

Now apply top‑p:

top‑p = 0.5

Working down the ordered Probability column abov, we include tokens until the probability is cumulatively ≥ 0.5. Token A + B are allowed as they are the first tokens for whom the cumulative probability is ≥ 0.5. Once the condition is satisfied, we stop descending the column.

With top-p = 0.5, only tokens A and B are allowed.

For top‑p = 0.8

Include tokens until cumulative ≥ 0.8 → A + B + C. Only A, B, C are allowed.

top‑p = 0.95

Include tokens until cumulative ≥ 0.95 → A + B + C + D + E. Tokens A to E allowed; F is excluded.

When top‑p = 1.0

No restriction — all tokens allowed.

Passing temperature and top-p to OpenAI

In calling OpenAI, you can pass this:

{
  "model": "gpt-4.1",
  "messages": [
    { "role": "user", "content": "Explain temperature and top-p." }
  ],
  "temperature": 0.0,
  "top_p": 1.0
}

The last two fields directly control the sampling behaviour.

You are telling the model:

"Always pick the highest‑probability token. No randomness."

This is the closest thing to true determinism.

With temperature set to 0.0, the highest‑probability token is guaranteed to be selected, as long as the decoding method is greedy and no other randomness is introduced by the API or framework.

In an LLM, the decoder is the component that turns the model’s probability distribution into tokens.

Even with temperature equal to 0.0, top‑p could still exclude the highest‑probability token. For example, if the highest‑probability token is outside the top‑p nucleus (rare but possible with unusual distributions), the decoder would be forced to pick a different token. The nucleus is the group of tokens built cumulatively above.

Temperature = 0.0 and top_p = 1.0 is the strictest, safest deterministic configuration.

Context windows are not memory

AI vendors such as Anthropic and OpenAI control the LLM's window size, but you control how effectively you use it.

OpenAI's GPT‑5.4 has a 1,050,000‑token context window. GPT‑5.2, GPT‑5.1, and GPT‑5.1 Codex Max have 400,000‑token windows.

The window size is fixed at training time. Changing it requires retraining or re‑architecting the model, which only the vendor can do.

The vendor sets the ceiling. You decide how close you get to it. A 1M‑token window sounds like "great, I can dump everything in." But that is the wrong mental model.

The engineer decides:

how much of the window to fill
how aggressively to compress
how to structure retrieval
how to order information
how to avoid interference
how to budget tokens across system prompts, instructions, schemas, and retrieved docs

The vendor gives you the maximum. You determine the effective window.

A large window looks powerful, yet it behaves nothing like a bigger RAM module. The more of the window you use and the larger your use becomes, the model has to scan and reconcile far more information than it can reliably use. The signal‑to‑noise ratio drops, and the model starts leaning on familiar statistical patterns instead of the details that matter.

Position inside the window matters more than the raw size. Early and late tokens are not treated equally, and different models weight them differently. There is no guarantee that the most recent content is the content the model will use. This is why long prompts often ignore the last instruction you added.

Large windows also increase interference. When you pack in too much material, similar concepts begin to blur. Two sections that look distinct to you can collide inside the model’s internal representation. The output feels vague or inconsistent even though the inputs look clean.

Retrieval quality beats window size

This is why retrieval quality beats window size. Retrieval gives you control over what enters the window and where it goes. A large window without retrieval is just a bigger bucket. A smaller window with good retrieval is a structured workspace.

Retrieval here is any form of data retrieval that is performed before being sent to the LLM. This may be the result of a classic RAG pipeline where a local search of a document store is performed and the results chunked before being passed to the LLM that is instructed to restrict its analysis to the uploaded search data.

But retrieval here is more general than RAG. It refers to the smart selection of data for an LLM to process. Retrieval may bring data back from a SQL, Graph or NoSQL query, or it may be the smart selection of summaries or user's notes pulled from storage.

The opposite of retrieval is dumping everything in raw.

The most reliable mental model is to treat the window as a scratchpad. It is a temporary working area, not a knowledge store. You place only what the model needs for the current task, in the order that helps it reason. If you treat the window like long‑term memory, you get unpredictable behaviour. If you treat it like a scratchpad, you get control.

LLMs compress patterns, not facts

When an LLM is trained, the input training data will be measured in terabytes. The output is billions of weights that encode the statistical structure of the training data. Those weights are the model es the weights: patterns (common sequences, phrasing, structures, and correlations); relationships (semantic similarity, analogies); generalisation behaviour (moving between examples via statistical interpolation); and task-relevant transformations to assist with instruction following, data formatting. and conversational norms.

LLMs do not store data; they are not databases. They store weights that represent patterns from the training data.

Many different training examples can be represented internally by the same (or very similar) set of weights.

As different examples can be represented by the same weights, LLMs have a tendancy to hallucinate. Hallucinations are baked into the design of LLMs.

Training takes terabytes of text and produces billions of updates into a fixed‑size model and outputs the weights that approximates the training data.

In doing this the transformation is many‑to‑one (different examples collapse together), and irreversible as you cannot reconstruct the originl training data from the weights. But, more importantly, the output is statistical as the weights encode likelihoods, not facts.

Because of this, the model cannot store exact information. It can only store patterns.

Where patterns overlap, details are lost. Where details are lost, the model fills in the gaps.

That filling‑in is what we call hallucination. The many-to-one transformation also explains why rare facts vanish and plausible but false details appear.

A fluent answer is not necessaily a correct one. A fluent answer should not be over-trusted.

An LLM is not a database or lookup table. They are function approximators trained on vast data, forced to compress it into a limited parameter space (weights), and optimised for prediction, not truth.

Prompting is programming

Prompts act like programs for a probabilistic interpreter. And as they are written in natural language, prompts are prone to the mistakes that humans make in written instructions: ambiguity, no being explicit on what is required; not stating what is not required; and failing to mention who the output is for.

Structure beats style so that you can be sure your prompt acts more like a foundation for a robust interface, rather than one without structur built on shifting sand.

Constraints

Constraints beat persuasion. Constraining your LLM is essential. It is not about "being firm" with the model. It is about shaping the space of valid outputs so the model cannot wander.

In a prompt, when you say:

"Please answer carefully.”;"Try not to hallucinate.”;"Make sure you follow the instructions.”; "Be precise."

You are appealing to behaviour the model cannot guarantee, because persuasion relies on the model choosing to comply. "Please answer carefully" is a request. The LLM should "try not to hallucinate". What if it does? You have not said. This is like neglecting to define an else on an if.

Persuasion is weak because it competes with every other pattern the model has learned.

Constraints, by contrast, reshape the output space.

A constraint is something that reduces the degrees of freedom the model has when generating.

Examples of constraints are having the prompt specify that the LLM must output its result using a schema or specifying a role with explicit boundaries such as a 'user', 'system', or 'assistant' or by specifying the LLM "must cite X before Y".

Instead of trying to "convince" the model to behave, you damp down as close to zero as possible the possibility of misbehaviour.

Schemas beat prose. Treat prompts as code and debug them as code. Systems behave better when you design prompts as logic, not decoration.

Conclusions

Tokens drive behaviour. If you want a dependable LLM system, you must engineer your solution to be aware of token‑level effects, not the surface text you see.

The brittle parts are not the models. They are the retrieval paths, templates, guardrails, and data plumbing wrapped around them. That is where systems breaks, and, over time, silently degrade.

Guardrails only work when they are deterministic constraints, not suggestions the model may or may not follow.

Observability has to expose every transformation in the chain — prompts, expansions, retrieval sets, decoding parameters, and outputs — or you cannot debug real failures.

Context windows are scratchpads, not memory. Treat them as temporary workspaces and nothing more.

Retrieval quality dominates correctness. Window size is secondary. If the retrieved evidence is weak, the answer will be weak.

Hallucination is not a bug. It is a structural consequence of pattern‑based generation. You mitigate it with system design, not trust.

And prompting only becomes stable when you treat it as programming with constraints, not persuasion.

Read next: Chat Interface to System Component
A deeper explanation of why structured interfaces matter for reliable AI behaviour.

If this was useful, you can get more pieces like it in the Phroneses newsletter.

Subscribe →

The real interface: tokens, not text
The model is not the system
Determinism is a design choice
Temperature
Top-p
LLMs compress patterns, not facts
Prompting is programming
Constraints
Conclusions
- Related Articles
- Table of Contents

What software engineers need to know about LLMs

Jh Evans

The real interface: tokens, not text

The model is not the system

Guardrails

Observability

An LLM is at the centre of a much larger system

Determinism is a design choice

Temperature

Top-p

Example

Passing temperature and top-p to OpenAI

Context windows are not memory

Retrieval quality beats window size

LLMs compress patterns, not facts

Prompting is programming

Constraints

Conclusions

Table of Contents

The real interface: tokens, not text

The model is not the system

Guardrails

Observability

An LLM is at the centre of a much larger system

Determinism is a design choice

Temperature

Top-p

Example

Passing temperature and top-p to OpenAI

Context windows are not memory

Retrieval quality beats window size

LLMs compress patterns, not facts

Prompting is programming

Constraints

Conclusions

Related Articles

Table of Contents