Large language models (LLMs) are disrupting the software engineering industry. Executives and software engineers now have a tool at their disposal that is so general in its scope that it can be dedicated to almost any task. LLMs are the ultimate "jack of all trades". It is our job to get the most from them.
The real interface: tokens, not text
Tokens shape what you can build. They decide how much context you can fit in, how fast the model responds, and how predictable the output is.
Token boundaries also change how the model interprets structure. Two prompts that look identical to you may tokenize differently and produce different behaviour.
When you design prompts, AI input or output schemas, or retrieval pipelines, you are really designing token flows. If you ignore tokens, you end up shipping features that behave one way in tests and another way in production.
Prompt A: "Summarize the user login flow."
Prompt B: "Summarise the user login flow."
To a human, the difference is not consequential. To a tokenizer, there is a critical difference.
"Summarize" and "Summarise" break into different token sequences.
The model’s internal statistics for each spelling differ.
The model may shift tone, structure, or level of detail.
And downstream formatting can change because the token pattern changed.
or
Prompt A: "List the steps to deploy the service."
Prompt B: "List the steps to deploy the service ."
The only difference is a space before the full-stop.
Prompt A ends with a single token for "service."
Prompt B ends with two tokens: "service" and "."
That tiny shift can change the model’s prediction path.
The model is not the system
Most failures blamed on models usually come from everything wrapped around them. In practice, the weak points look very familiar to any engineer who has shipped a distributed system.
Retrieval pipelines drift because indexes age, embeddings shift, and data freshness is rarely monitored. A model can only answer the question you actually retrieved, not the one you meant to retrieve.
Prompt templates collapse under odd inputs because they are often treated as static strings instead of executable logic. One unexpected newline or a missing field can break the entire chain of reasoning. Data freshness and data cleansing is key here.
Guardrails
Guardrails miss edge cases because they rely on pattern matching, not semantic guarantees. A single unhandled phrasing can bypass a rule that looked airtight in testing.
Imagine you build a guardrail that blocks requests containing "delete all users". It works in tests, so you ship it.
Then a real user sends: "can you delete all the users" or "please delete every user" or "remove all user accounts"
Your guardrail only catches the exact phrase it was written for. It matches strings, not meaning. One phrasing slips through, and the model executes a path you thought was protected.
Many guardrails end up acting like string comparisons even when they use embeddings or classifiers. They match surface patterns, not intent. If the phrasing shifts, the guardrail often fails.
For example, a rule might block "delete all users" because that exact pattern was seen during testing. But the same system may allow "remove every user account" because the embedding distance is just far enough to slip past the threshold.
This is the same failure mode as brittle input validation. If your rules depend on matching specific strings or narrow patterns, you get a system that behaves safely in tests and unpredictably in production.
You cannot solve this by telling the model “if a request is like 'delete all users', refuse to do it”. That feels intuitive, but it fails for the same reason input‑validation-by-string-match fails in any other system.
A prompt can describe the rule, but it cannot enforce the rule. The model will try to follow the instruction, but it has no semantic guarantee. It can still be persuaded, confused, or bypassed by a phrasing it has not seen before.
To actually solve this, you need layered controls outside the model:
-
Treat the model as untrusted. Never let it directly execute destructive actions. Put a permission layer between the model and anything irreversible.
-
Normalise user input before it reaches the model. Collapse phrasing, remove fluff, and classify intent. This gives you a stable signal instead of raw text.
-
Use a separate classifier or rules engine to detect dangerous intent. This component should be simpler, more predictable, and easier to test than the model itself.
-
Require explicit confirmation for destructive operations. The model can propose an action, but a deterministic system must approve it.
-
Log every step. When something slips through, you need to see the input, the normalised form, the classification result, and the model’s output.
The prompt can express the policy, but the system must enforce it. If you rely on the model alone, you are depending on pattern matching. If you build a layered pipeline, you get behaviour you can reason about, test, and trust.
Observability
Observability is weak because most systems log the request and the response, but not the context, the retrieval set, the template expansion, or the decoding parameters. When working with LLMs, without the context, retrieval set, template expansion and parameter decoding, debugging is guesswork.
An LLM is at the centre of a much larger system
The LLM is only one component. The system around it decides whether your product behaves like a tool or a slot machine. Engineers who treat the whole pipeline as a software system, not a magic box, build the reliable systems.
Determinism is a design choice
LLMs are probabilistic, but stability is possible. Temperature and top‑p control variance. Structured outputs reduce drift. Deterministic decoding is often more reliable than clever prompts. Treat randomness as a resource you allocate.
Temperature stretches or compresses the probability distribution. Top‑p chops off the tail of the distribution.
Temperature
As temperature increases, the LLM becomes more willing to pick lower‑probability tokens, which effectively means the "token candidate set" gets larger.
More accurately, low‑probability tokens get boosted, high‑probability tokens get flattened.
This means: the model is less confident, more tokens become available, and he sampling process has more room to explore. The next token is drawn from a wider effective set
Top-p
Top‑p (also called nucleus sampling) restricts the model to sampling only from the smallest set of tokens whose cumulative probability is ≥ p.
Think of it as a probability mass cutoff.
Example
Suppose the model predicts the next‑token distribution like this:
| Token | Probability | Cumulative |
|---|---|---|
| A | 0.40 | 0.40 |
| B | 0.25 | 0.65 |
| C | 0.15 | 0.80 |
| D | 0.10 | 0.90 |
| E | 0.05 | 0.95 |
| F | 0.05 | 1.00 |
Sorted by probability, cumulative mass builds like this:
A → 0.40 A+B → 0.65 A+B+C → 0.80 A+B+C+D → 0.90 A+B+C+D+E → 0.95 A+B+C+D+E+F → 1.00
Now apply top‑p:
top‑p = 0.5
Working down the ordered Probability column abov, we include tokens until the probability is cumulatively ≥ 0.5. Token A + B are allowed as they are the first tokens for whom the cumulative probability is ≥ 0.5. Once the condition is satisfied, we stop descending the column.
With top-p = 0.5, only tokens A and B are allowed.
For top‑p = 0.8
Include tokens until cumulative ≥ 0.8 → A + B + C. Only A, B, C are allowed.
top‑p = 0.95
Include tokens until cumulative ≥ 0.95 → A + B + C + D + E. Tokens A to E allowed; F is excluded.
When top‑p = 1.0
No restriction — all tokens allowed.
Passing temperature and top-p to OpenAI
In calling OpenAI, you can pass this:
{
"model": "gpt-4.1",
"messages": [
{ "role": "user", "content": "Explain temperature and top-p." }
],
"temperature": 0.0,
"top_p": 1.0
}
The last two fields directly control the sampling behaviour.
You are telling the model:
"Always pick the highest‑probability token. No randomness."
This is the closest thing to true determinism.
With temperature set to 0.0, the highest‑probability token is guaranteed to be selected, as long as the decoding method is greedy and no other randomness is introduced by the API or framework.
In an LLM, the decoder is the component that turns the model’s probability distribution into tokens.
Even with temperature equal to 0.0, top‑p could still exclude the highest‑probability token. For example, if the highest‑probability token is outside the top‑p nucleus (rare but possible with unusual distributions), the decoder would be forced to pick a different token. The nucleus is the group of tokens built cumulatively above.
Temperature = 0.0 and top_p = 1.0 is the strictest, safest deterministic configuration.
Context windows are not memory
AI vendors such as Anthropic and OpenAI control the LLM's window size, but you control how effectively you use it.
OpenAI's GPT‑5.4 has a 1,050,000‑token context window. GPT‑5.2, GPT‑5.1, and GPT‑5.1 Codex Max have 400,000‑token windows.
The window size is fixed at training time. Changing it requires retraining or re‑architecting the model, which only the vendor can do.
The vendor sets the ceiling. You decide how close you get to it. A 1M‑token window sounds like "great, I can dump everything in." But that is the wrong mental model.
The engineer decides:
- how much of the window to fill
- how aggressively to compress
- how to structure retrieval
- how to order information
- how to avoid interference
- how to budget tokens across system prompts, instructions, schemas, and retrieved docs
The vendor gives you the maximum. You determine the effective window.
A large window looks powerful, yet it behaves nothing like a bigger RAM module. The more of the window you use and the larger your use becomes, the model has to scan and reconcile far more information than it can reliably use. The signal‑to‑noise ratio drops, and the model starts leaning on familiar statistical patterns instead of the details that matter.
Position inside the window matters more than the raw size. Early and late tokens are not treated equally, and different models weight them differently. There is no guarantee that the most recent content is the content the model will use. This is why long prompts often ignore the last instruction you added.
Large windows also increase interference. When you pack in too much material, similar concepts begin to blur. Two sections that look distinct to you can collide inside the model’s internal representation. The output feels vague or inconsistent even though the inputs look clean.
Retrieval quality beats window size
This is why retrieval quality beats window size. Retrieval gives you control over what enters the window and where it goes. A large window without retrieval is just a bigger bucket. A smaller window with good retrieval is a structured workspace.
Retrieval here is any form of data retrieval that is performed before being sent to the LLM. This may be the result of a classic RAG pipeline where a local search of a document store is performed and the results chunked before being passed to the LLM that is instructed to restrict its analysis to the uploaded search data.
But retrieval here is more general than RAG. It refers to the smart selection of data for an LLM to process. Retrieval may bring data back from a SQL, Graph or NoSQL query, or it may be the smart selection of summaries or user's notes pulled from storage.
The opposite of retrieval is dumping everything in raw.
The most reliable mental model is to treat the window as a scratchpad. It is a temporary working area, not a knowledge store. You place only what the model needs for the current task, in the order that helps it reason. If you treat the window like long‑term memory, you get unpredictable behaviour. If you treat it like a scratchpad, you get control.
LLMs compress patterns, not facts
When an LLM is trained, the input training data will be measured in terabytes. The output is billions of weights that encode the statistical structure of the training data. Those weights are the model es the weights: patterns (common sequences, phrasing, structures, and correlations); relationships (semantic similarity, analogies); generalisation behaviour (moving between examples via statistical interpolation); and task-relevant transformations to assist with instruction following, data formatting. and conversational norms.
LLMs do not store data; they are not databases. They store weights that represent patterns from the training data.
Many different training examples can be represented internally by the same (or very similar) set of weights.
As different examples can be represented by the same weights, LLMs have a tendancy to hallucinate. Hallucinations are baked into the design of LLMs.
Training takes terabytes of text and produces billions of updates into a fixed‑size model and outputs the weights that approximates the training data.
In doing this the transformation is many‑to‑one (different examples collapse together), and irreversible as you cannot reconstruct the originl training data from the weights. But, more importantly, the output is statistical as the weights encode likelihoods, not facts.
Because of this, the model cannot store exact information. It can only store patterns.
Where patterns overlap, details are lost. Where details are lost, the model fills in the gaps.
That filling‑in is what we call hallucination. The many-to-one transformation also explains why rare facts vanish and plausible but false details appear.
A fluent answer is not necessaily a correct one. A fluent answer should not be over-trusted.
An LLM is not a database or lookup table. They are function approximators trained on vast data, forced to compress it into a limited parameter space (weights), and optimised for prediction, not truth.
Prompting is programming
Prompts act like programs for a probabilistic interpreter. And as they are written in natural language, prompts are prone to the mistakes that humans make in written instructions: ambiguity, no being explicit on what is required; not stating what is not required; and failing to mention who the output is for.
Structure beats style so that you can be sure your prompt acts more like a foundation for a robust interface, rather than one without structur built on shifting sand.
Constraints
Constraints beat persuasion. Constraining your LLM is essential. It is not about "being firm" with the model. It is about shaping the space of valid outputs so the model cannot wander.
In a prompt, when you say:
"Please answer carefully.”;"Try not to hallucinate.”;"Make sure you follow the instructions.”; "Be precise."
You are appealing to behaviour the model cannot guarantee, because persuasion
relies on the model choosing to comply. "Please answer carefully" is a request. The LLM
should "try not to hallucinate". What if it does? You have not said. This is like
neglecting to define an else on an if.
Persuasion is weak because it competes with every other pattern the model has learned.
Constraints, by contrast, reshape the output space.
A constraint is something that reduces the degrees of freedom the model has when generating.
Examples of constraints are having the prompt specify that the LLM must output its result using a schema or specifying a role with explicit boundaries such as a 'user', 'system', or 'assistant' or by specifying the LLM "must cite X before Y".
Instead of trying to "convince" the model to behave, you damp down as close to zero as possible the possibility of misbehaviour.
Schemas beat prose. Treat prompts as code and debug them as code. Systems behave better when you design prompts as logic, not decoration.
Conclusions
Tokens drive model behaviour, so any dependable LLM system must be engineered around token‑level effects rather than surface text; the fragile parts of the stack are the retrieval, templates, guardrails, and data plumbing wrapped around the model, not the model itself; guardrails only become reliable when enforced by deterministic system logic instead of relying on the model’s cooperation; observability must reveal every transformation in the pipeline to make failures diagnosable; context windows function as short‑lived workspaces rather than any form of memory; retrieval quality has a larger impact on correctness than window size; hallucination is an unavoidable consequence of pattern compression and must be mitigated through system design rather than trust; and prompting only becomes stable when treated as programming with explicit constraints instead of attempts at persuasion.
Related Work
- LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.
- The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding.
- AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.
If this piece was useful, you’ll appreciate the free Phroneses newsletter — clear thinking on engineering leadership, organisational clarity, and reliable systems. Practical, honest, and built for people who care about doing the work well.
I work with leaders and teams on clarity, capability, and momentum. Work with me →