Phroneses.com

Why Junior Engineers Matter More as AI Expands

2026-05-27T00:00:00+00:00

The Adaptation of the Junior Engineer in an AI‑Accelerated Profession

The landscape has shifted. AI can generate code at a pace that would have been unthinkable a few years ago, but speed is not the work.

Speed cannot decide what should exist, why it matters, or whether it is safe. The belief that a junior can lean on AI and bypass the discipline is a misreading of the craft.

Early‑career engineers are needed more than ever because the judgement required to guide, verify, and constrain AI now sits at the centre of the role.

The junior position is not disappearing. It is being reshaped. AI has lowered the cost of producing code, but it has raised the cost of understanding what that code means. The work has not become smaller; it has become sharper, with an additional focus.

The organisations that recognise this early will keep their engineering discipline intact. The ones that do not will discover that AI exposes weaknesses in thinking faster than they can respond.

The Changing Weight of the Work

Typing has never been the job. It was simply the visible part of it. The real work — analysis, verification, risk thinking, system reasoning, and safety — has always carried the weight. AI accelerates the mechanical layer and exposes the cognitive one. Juniors now meet the deeper parts of the discipline sooner, and the expectations rise accordingly.

This shift is not cosmetic. It is economic. When code becomes cheap, correctness becomes expensive. The cost of a faulty assumption, a missed constraint, or a silent failure grows. The value of the junior engineer lies in their ability to prevent these errors before they harden into production.

AI Introduces New Types of Failure

When using an LLM in a pipeline, AI introduces new categories of failure: output-level instability, and behavioural-level instability.

Output-level Instability

LLMs are non-deterministic, probability machines.

Because of this schema drift, hallucinations, and silent truncation of results, can all ocur. The junior staff member will need to develop skills in detecting and handling all of these. These are changes in the way the LLM might respond to your system so your calling system must be robust to such variety.

Behavioural-level Instability

Across multiple LLM calls, even if the shape of the output result is the same, the behaviour of the LLM may change internally.

Given an identical prompt, "Extract the customer’s job title", and the same input, "My name is Helen and I work as a senior analyst at JPMG", the first call may return "senior analyst", the second may return "analyst", and the third may return "Senior Analyst".

In this case, all data passed to the LLM (the prompt and the input) and the output schema (a string in each case) remain the same. However, a change in the LLM’s internal behaviour has produced different outputs. Juniors need to be attuned to this possibility and know how to address it.

The Organisational Obligation

None of this works if organisations cling to the old model. Juniors cannot develop judgement in an old environment optimised for throughput. They need structured mentorship, slower reviews, and the psychological safety to test their reasoning.

Juniors need decision‑rights that are clear, not implied. Decision-rights are an understanding between the junior and their colleagues on what they can decide for themselves, and what they cannot and must seek input to resolve.

Juniors need leaders who understand that judgement is not taught by accident.

If the system does not adapt, the junior cannot.

Emerging Responsibilities

The adapted junior role becomes more investigative and more integrative. The work stretches across definition, verification, safety, and coherence.

Problem framing becomes central. Before any code is generated, the junior and their team must be clear on what the business is trying to achieve.
Constraint recognition grows in importance. Boundaries, risks, and compliance obligations must be surfaced early.
AI‑guided exploration replaces manual iteration. The junior evaluates options rather than producing them from scratch.
Verification discipline becomes essential. Plausible output is not enough. It must be correct, safe, and aligned with intent. AI can generate as much code as you want. But is it the right code? Determining whether generated code is the right code is part of the junior's role, supported by their team, the development process and wider engineering leadership.
Integration awareness develops sooner. Systems fail at the seams, not in isolation. The junior must develop skills to be aware of this and build solutions that are hardened to failure.
Operational literacy becomes expected. Logs, metrics, observability, and incident handling enter the junior toolkit.
Documentation clarity gains weight. Decisions must be legible and reproducible. "The AI did it" is not a defence.

Should your organisation invoke an LLM as part of a processing pipeline, token-level reasoning becomes a topic that needs addressing. Even with an identical prompt, an LLM's internal behaviour may vary unless steps are taken to constrain temperature, top-p, and top-k. However, even if these values are set to 0, 0, and 1 respectively (so that the LLM chooses the highest-probability next token), the quality of the response may decrease. This decrease is due to multiple factors: the LLM becoming overly literal when processing the prompt, and becoming less robust to ambiguous input. The LLM may fail on a task requiring synthesis or nuance as these require variety over the next token, not always the highest‑probability one.

These responsibilities demand human judgement. AI cannot supply it.

Failure‑Mode Literacy

Engineering maturity is measured by how you handle failure, not how quickly you produce output. Juniors must learn to read failure modes: what breaks, why it breaks, and how the system behaves under stress.

This is where judgement is forged.

Evaluating LLM output

Both output-level and behaviour-level instability require your junior to learn the discipline of evaluating model behaviour, not just observing it.

LLM output must be tested for schema reliability, instruction adherence, grounding fidelity, and deterministic stability. Behaviour must be measured over time so that drift is detected early rather than discovered in production.

Evaluation becomes part of the junior role because correctness is now the expensive part of the work. AI accelerates your ability to produce code, so humans must strengthen verification.

Juniors often see AI‑generated artefacts first, which means they become the first line of defence against drift, hallucination, and structural failure.

The junior role is not shrinking, it is moving closer to the centre of the system.

Schema reliability

Schema reliability is the stability of the output structure across calls. It asks whether the model returns the same shape every time. A reliable schema preserves field names, nesting, ordering, and types. When schema reliability is weak, downstream systems break: parsers fail, validators reject output, and silent truncation corrupts results. Juniors must learn to detect when the structure shifts, even subtly, because structural instability will cause production failure.

Instruction adherence

Instruction adherence is the model’s ability to follow the constraints it was given. It measures whether the output respects required fields, forbidden content, formatting expectations, safety constraints, and domain‑specific rules. Weak adherence produces plausible but incorrect output that appears compliant but violates intent. Juniors must learn to test adherence explicitly, because LLMs often drift away from constraints under load, ambiguity, or long contexts.

Grounding fidelity

Grounding fidelity is the degree to which the model’s output remains anchored to the provided context, data, or retrieval results. High fidelity means the model stays within the evidence; low fidelity means it fabricates, embellishes, or substitutes. This is the core defence against hallucination. Juniors must learn to check whether each claim in the output can be traced back to a source. Without grounding fidelity, correctness becomes guesswork and organisational risk increases.

Deterministic stability

Deterministic stability is the consistency of the model’s behaviour under identical conditions. It measures whether repeated calls with the same prompt, same context, and same parameters produce meaningfully similar results. Instability here signals deeper behavioural drift: model updates, sampling variance, context‑window rollover, or upstream nondeterminism. Juniors must learn to monitor this stability because unpredictable behaviour, even within a fixed schema, undermines trust in the system.

Once evaluation becomes routine, the next layer of responsibility emerges. Understanding how AI‑driven behaviour interacts with organisational risk, regulation, and safety boundaries becomes a concern.

Compliance and Safety

AI introduces new liabilities. Licensing, data handling, regulatory expectations, model hallucinations, and architecture all sit inside the junior’s world now. The business must help them to learn to recognise unsafe output and understand the organisational risk attached to it. Secure by default is no longer a slogan; it is a habit.

Once an LLM becomes part of your production pipeline, it represents a system-level reliability concern. Junior colleagues will need to understand retrieval hops, orchestration cost, and architectural latency.

Creation vs Integration

Many teams still confuse "using a chatbot to generate new code" with "running an LLM inside a production pipeline". These are not the same problem: the former accelerates creation, while the latter introduces system‑level reliability concerns that juniors must learn to evaluate.

But even chatbot‑generated code is not free. It must still be evaluated to answer the question: "is adding this code into our system the right thing to do?"

The distinction matters because both activities demand judgement, but pipeline integration demands system‑level reasoning and reliability awareness.

The Apprenticeship Model Returns

AI compresses the early stages of skill acquisition because the novice to intermediate gap is mostly about knowledge access, pattern exposure, and basic scaffolding.

A novice must learn vocabulary, syntax, idioms, and the shape of common solutions ("house rules"). An LLM can supply this information instantly: it provides examples, explanations, and templates on demand. This removes much of the friction that traditionally slows early progress, so with AI the distance between novice and intermediate shrinks.

But the intermediate to senior gap is not reduced, because seniority is not a knowledge problem. It is a judgement problem formed through apprenticeship: pairing, review, reflection, and exposure to real events on real systems under real constraints.

Senior engineers develop taste, trade‑off literacy, failure intuition, and a sense of responsibility for long‑term consequences. These abilities cannot be acquired through text prediction alone. They come from lived experience with real systems, real failures, and real organisational pressures.

AI accelerates learning, but senior judgement is produced by responsibility, constraint, and lived experience. These are conditions that AI cannot inhabit. The craft remains intact because the essence of mastery is grounded in practice shaped by real systems, real failures, and real organisational pressures, not by information alone.

Juniors must learn the difference between additive work (generating new code), and transformative work (modifying existing systems). To transform an existing system safely requires judgement. Your organisation will need to support your junior colleague in developing that judgement given your company's unique codebase, infrastructure and culture.

A New Path to Seniority

Seniority emerges from judgement, not keystrokes. The route to senior for the junior shifts toward structure, risk, and operational thinking.

Architecture literacy develops earlier. Patterns and constraints become part of daily reasoning.
Risk thinking becomes foundational. Engineers learn to anticipate failure and design for recovery.
Review competence shifts from syntax to structure. The question becomes: does this code make sense?
Operational competence becomes core. Observability and incident handling help to shape judgement.
Decision clarity becomes a differentiator. Seniors articulate reasoning, not just outcomes.
Cross‑functional communication grows in importance. Complexity must be translated into clarity.

Juniors are ideally placed to contribute to AI-augmented team processes: reviewing AI-generated artefacts, maintaining team-level shared understanding, and helping to ensure coherence across accelerated workflows.

The work becomes less about producing code and more about shaping the conditions in which code can be trusted.

The Cultural Shift

High‑pace environments often reward noise. AI accelerates that tendency. But the teams that thrive will be the ones that reward clarity instead. Juniors need a culture that values slow thinking at the right moments, not constant motion.

Expectations of juniors will vary depending on the AI‑maturity of your organisation.

In low‑maturity environments, juniors are forced to compensate for weak processes, unclear decision‑rights, and inconsistent use of AI.

In high‑maturity environments, juniors grow faster because the system around them is stable: prompts are versioned, retrieval is predictable, evaluation is routine, and model updates are treated as engineering events. The culture determines whether AI becomes an accelerant for judgement or a multiplier of confusion.

Practical First Steps for Juniors

Learn to articulate intent before touching a tool.
Practise verifying AI output with suspicion and skepticism, not trust.
Build small systems and observe how they behave under load.
Document decisions as if someone else must rely on them.
Study failure modes; they teach more than success ever will.

Practical First Steps for Leaders

Define decision‑rights explicitly. What can a junior decide for themself?
Slow down reviews to create space for reasoning.
Pair juniors with seniors intentionally, not incidentally.
Treat AI as an accelerator, but only within well‑understood and defined boundaries.
Build a culture where clarity is rewarded and noise is not.

AI is a tool. How can you best use that tool to help the junior do their best work? AI is not a replacement for the junior but an assistant.

The Evolving Value of the Junior Engineer

Juniors become force multipliers. They use AI to explore the solution space, stress‑test assumptions, and verify generated artefacts. They learn system thinking earlier and contribute meaningfully sooner. But only if the organisation supports them.

Ask not what your junior can do for you — ask what you can do for your junior.

Final Thoughts

Engineering is not being erased. It is being reweighted. Humans decide what should exist, why it matters, and whether it is safe. AI writes the code. The profession continues to evolve, but its centre of gravity remains the same: judgement, clarity, and the ability to read systems before safely changing them.

When Urgency is High but Progress is Slow

2026-05-26T00:00:00+00:00

When urgency rises faster than progress

Leaders often find themselves in a situation where urgency keeps increasing but progress does not follow. The pace is high, the pressure is real, yet the work feels harder to move forward. This is not a failure of intent. It is a sign that the operating conditions around the leader have shifted in ways that are not immediately visible.

Do you recognise this in your own environment? The symptoms are familiar: unclear ownership, AI‑driven noise, delivery friction, and teams struggling to make sound decisions at speed. These pressures do not call for more effort or inspiration. They call for structure, judgement, and operating clarity that can be applied tomorrow.

The thinking behind phroneses is built for this reality. It treats leadership as a system: decision‑rights, flow, constraints, and the conditions that allow teams to move with confidence when complexity rises. This is not a framework or a slogan. It is a way of seeing the organisation that makes the next step clearer and the work easier to lead.

When leaders adopt this way of thinking, the effect is immediate. Noise reduces. Decisions sharpen. Ownership becomes clearer. Progress becomes steadier because the system becomes easier to understand and easier to shape.

As this clarity strengthens, the role of leadership becomes clearer too. The energy shifts from reacting to pressure toward creating the conditions that allow teams to thrive. That is where your real leverage sits, and where you will have the most impact.

Before You Adopt AI in Engineering, Answer These Five Questions

2026-05-24T00:00:00+00:00

Executive Summary

AI is already reshaping your delivery workflows, whether you see it or not. If you do not lead it, it will reshape them badly. This article gives executives a stage‑aligned diagnostic to identify their real maturity, expose hidden risks, and steer AI adoption with intent rather than drift.

What This Is Not

Not a hype piece
Not a vendor framework
Not a technical guide
Not a generic AI playbook
Not a promise of productivity

This is a leadership instrument for understanding and directing AI adoption.

The Problem in One Sentence

Most organisations believe they are progressing in AI; their workflows show they are still in unmanaged use.

AI Adoption Maturity Model

Curiosity → Ad‑hoc → Uncoordinated → Stabilisation → Integration → Reconfiguration

Each stage includes: - Stage signal: what you see - Failure mode: what breaks if you stay here - Leadership responsibility: what executives must do

Stage 0 — Experimentation

Stage signal: Small groups test AI tools in isolation; nothing links to delivery.
Failure mode: No patterns survive; no organisational learning occurs.
Leadership responsibility: Do not mistake curiosity for capability. If you stay here, AI adoption will happen without you.

Stage 1 — Unmanaged Individual Use

Stage signal: Engineers use AI daily but invisibly; quality drifts; no review.
Failure mode: Shadow workflows reshape delivery without oversight.
Leadership responsibility: Surface usage and risk before anything scales. If you stay here, quality and security will drift invisibly.

Stage 2 — Team‑Level Awareness

Stage signal: Teams feel friction: uneven output, duplicated prompts, unclear fixes.
Failure mode: Teams believe they are maturing; leaders believe it even more.
Leadership responsibility: Establish boundaries and shared expectations. If you stay here, teams will burn time managing friction instead of delivering.

Stage 3 — Organisational Alignment

Stage signal: Workflows stabilise; AI review stages and documentation improve.
Failure mode: Premature scaling without observability or constraints.
Leadership responsibility: Standardise workflows and measure impact. If you stay here, AI will outgrow your controls.

Stage 4 — Integrated AI Engineering

Stage signal: AI is a system component with constraints, observability, governance.
Failure mode: Drift and quality collapse if leadership attention drops.
Leadership responsibility: Maintain discipline; treat AI as infrastructure.

Stage 5 — Organisational Redesign

Stage signal: Processes, roles, and flow reshape around AI‑accelerated work.
Failure mode: Redesign without stability leads to chaos.
Leadership responsibility: Rebuild systems deliberately, not reactively.

Common Misdiagnoses

Executives repeatedly misread their organisation’s maturity in predictable ways:

Mistaking Stage 1 for Stage 3
Mistaking individual speed for organisational capability
Mistaking experimentation for adoption
Mistaking friction for progress
Mistaking tool usage for system change

If any of these appear familiar, your organisation is exposed to silent quality drift, security risk, and delivery incoherence.

Five Essential Questions for Engineering and Executive Leadership

These questions are the diagnostic. If you cannot answer one cleanly, you are not at the stage you think you are.

1. What AI use already exists, and which maturity stage does it actually represent?

Stage signal:

0–1: Usage is invisible, individual, unreviewed
2: Teams feel friction but cannot coordinate
3+: Workflows, review steps, and boundaries are explicit

Executive signal: If you cannot see AI use, you cannot govern it. Invisible use is the most dangerous form of adoption because it reshapes delivery without review or audit.

Leadership action: Surface all usage, tools, risks, and drift before scaling anything.

2. Where does AI reduce cognitive load or cycle time for whole teams, not just individuals?

Stage signal:

0–1: Productivity is anecdotal and personal
2: Teams see uneven output and duplicated effort
3: Shared workflows show measurable improvement
4–5: AI contributes to throughput as part of the system

Executive signal: Individual acceleration is not organisational capability. Individual use without team coherence increases delivery variance.

Leadership action: Identify where AI improves team‑level flow; ignore individual anecdotes.

3. What controls, review steps, and boundaries are required at our current stage?

Stage signal:

0–1: No guardrails; risk accumulates quietly
2: Teams ask for boundaries but cannot define them
3: Review steps and constraints become standardised
4: Governance and observability are built into the system

Executive signal: Scaling without controls guarantees failure. Missing controls at Stage 1 allows unreviewed changes into critical workflows.

Leadership action: Match controls to your actual stage, not your aspirations.

4. Which organisational foundations must be strengthened before we can safely move to the next stage?

Stage signal:

0–2: Documentation, testing, ownership, architecture inconsistent
3: Foundations stabilise because AI workflows depend on them
4–5: Strong foundations multiply value; weak ones collapse instantly

Executive signal: AI amplifies whatever environment it enters. Weak foundations are already being stressed by AI‑accelerated work.

Leadership action: Ensure the environment is AI‑compatible: clarity, ownership, documentation, testing, and architecture must be strong enough to absorb AI‑accelerated change.

5. How will leadership set expectations and pace adoption so it matches our capacity to absorb change?

Stage signal:

0–1: Expectations inflated; progress invisible
2: Teams feel strain; leaders misread friction as maturity
3: Communication grounded in measurable workflows
4–5: AI adoption becomes organisational change, not tooling

Executive signal: Most organisations believe they are at Stage 3 while operating at Stage 1–2. Pacing is a leadership responsibility, not a technical one.

Leadership action: Set expectations that match reality; pace adoption deliberately.

Leadership Imperative

AI adoption is already happening inside your organisation. Your only choice is whether it reshapes your workflows with structure or erodes quality, coherence, and trust without it.

If You Only Do One Thing

Identify your true maturity stage. Everything else depends on that.

Agents Cannot Maintain Systems: The Additive–Transformative Gap in LLM Software Delivery

2026-05-21T00:00:00+00:00

This article explains why current LLMs cannot safely modify real software systems, despite impressive code‑generation demos.

Table of contents

The Promise of Automated Software Delivery

In 2026, the automated software delivery dream is for an agent to:

read a repository
understand project structure
plan a multi‑step change
write code, tests, and docs
run the code and fix its own mistakes
produce a PR‑ready diff

The first three tasks are additive; the last three are transformative. The first three add information without changing the behaviour of the system: they require reading, mapping, and planning, but not altering any existing causal structure in the codebase.

Applying new code is self-contained, additive work; modifying an existing system is transformative work that requires an understanding of dependencies, invariants, and consequences. This distinction — additive vs transformative — is the core reason current LLMs can assist but cannot autonomously deliver software.

Parts of the above can be done but only for tightly controlled demos on simple code that is tens of lines long, not on real-world repositories with thousands of lines of code that has existed for years where dozens of people have updated it.

What the Labs Have Actually Delivered

The agentic work of OpenAI, Google, Cognition Labs, GitHub (Microsoft), Sourcegraph, JetBrains, Replit, Amazon, Meta, and Anthropic, that is listed in Further Reading, was published in 2023 and 2024.

Depending on where you look, you may have been given another impression: that "agents are here". However, reality tells a different story.

Agents are improving, but are not reliable, not autonomous, and not production‑safe.

LLMs can assist with software delivery, but they cannot own it.

Why is this?

LLMs generate statistically plausible continuations of text. This works well for self-contained tasks like writing a function or drafting documentation because these are pattern‑extension problems. But pattern‑matching is not system understanding, and plausibility is not correctness.

Software systems are causal: components depend on each other, invariants constrain behaviour, and changes propagate through the system. The moment a task stops being self‑contained and becomes system‑dependent — requiring dependency coherence, persistent state, or awareness of how changes ripple through a real codebase — pattern‑matching is no longer sufficient.

Currently, LLMs can imitate the shape of engineering work, but they cannot maintain a stable internal representation of a system that must be coherently changed, and that gap is exactly why LLMs fail the moment the task becomes system‑level.

Persistent state creates temporal dependencies

A self‑contained task has no past and no future. A system‑dependent task does.

As soon as a change depends on:

previous writes
accumulated data
cached values
long‑lived objects
external system state

any agentic model must reason about how the system got here and how it will behave after the change.

LLMs cannot maintain that internal causal chain.

Writing code to Agentic Systems: The Fundamental Gap

The gap becomes clear when you compare two activities: writing new code and modifying an existing system.

Code generation is local and additive: the model extends a pattern without needing to understand the system.

But agentic work is global and transformative: the LLM must change the system itself, which requires understanding dependencies, invariants, interactions, and downstream consequences.

This is causal reasoning, not pattern extension. LLMs predict tokens, not consequences — and that is why the leap from writing code to producing a safe, system‑aware PR‑ready diff is not incremental but a shift into a fundamentally different problem space.

Producing a PR‑ready diff (the section in question)

A pull request (PR) is a piece of code that will change a system.

For that change to be safe, the change must respect the system's current architecture, its intent, and all downstream consequences.

Software engineers work hard to ensure that such a change is safe through testing and their own judgement and experience before having a collegue review the change.

Applying a change is no longer pattern-matching but understanding causal behaviour: how will the system change if this PR is applied?

The correctness of the PR depends on understanding the whole system, not just generating text.

The LLM must change the system, which requires understanding dependencies, invariants, interactions and consequences, all of which demand causal reasoning, not pattern matching.

Pattern‑matching can write code; only causal reasoning can maintain systems.

What can I do?

Confirm for yourself any claim that you see. Define your own realistic real-world repository to work on, one that is thousands of lines of code, that has supported past real-world work patterns.

Having your own results, applied to your own repository will tell you volumes more than any press release or online anecdote.

For the moment:

treat agentic AI as a strategic direction
treat current tools as assistants, not engineers
invest in clarity, architecture, and test discipline
expect progress, but not miracles
do not plan delivery pipelines around unproven capabilities

Maintain human judgement as the centre of the system.

The dream is intact. The evidence is not yet here.

Why this matters: code is cheap, judgement is not

LLM-augmented software delivery does not remove engineering.

It moves engineering up a level.

Humans need to focus on:

intent
constraints
architecture
correctness
safety
trade‑offs

The desired end state is not "AI writes code" but AI maintains systems. If we get there, humans will still need to maintain intent.

The consequence of an agentic system is not to remove engineering, but to elevate it, so that teams spend less time on mechanical construction and more time on judgement, alignment, and shaping the environment in which agents operate.

The organisations that benefit most will be those that treat agentic development not as automation, but as a structural shift in how software is conceived, validated, and maintained.

Final Thought

Until AI can reason causally about systems, human judgement remains the foundation of software delivery.

The Promise of Automated Software Delivery
What the Labs Have Actually Delivered
Why is this?
Persistent state creates temporal dependencies
Writing code to Agentic Systems: The Fundamental Gap
Producing a PR‑ready diff (the section in question)
What can I do?
Why this matters: code is cheap, judgement is not
Final Thought
Related Work
Table of Contents
Further Reading

When Code Is Cheap, Judgement Matters More

2026-05-20T00:00:00+00:00

Table of contents

SDD Is a Symptom, not a Methodology

Getting software delivered has always required a specification.

Having a clear specification of what is required is essential.

Writing such a spec is a collaborative effort:

Product owns the business intent
Engineering owns the technical constraints
Design owns the interaction and behaviour

The spec is a shared artefact formed through deliberate thinking and judgement. It must embody strategy and confirm that what is to be built is relevant.

The software industry now suggests that having a specification will make AI tooling more reliable. No. And this is not new.

A clear spec has always meant that the outcome is more likely to be successful.

SDD for AI-augmented teams is just a 30-year-old idea in a sparkly jacket.

What is new

SDD is not new. But the context is.

SDD is being reframed as:

a way to generate code from structured specs
a way to constrain AI agents
a way to reduce non‑determinism
a way to enforce governance in AI‑augmented pipelines

This reframing gives the impression that SDD is a new discipline rather than a new label for long‑standing engineering practice.

The spec is not the goal. Working software is.

Regardless of who writes the spec, you will need to iterate: build, release, gather user and market feedback, and steer with additional thinking and judgement.

SDD Surfaces When Teams Confront Ambiguity

SDD appears when teams realise:

their requirements are too vague
their systems are too implicit
their data contracts are too loose
their AI tooling is too unpredictable

SDD is the label people reach for when they need clarity, structure and determinism.

You do not need SDD. You need clarity, structure and determinism.

Write a spec, get the code for free?

The assumption in tech currently seems to be, write a spec, feed it into an AI and get out all the code you need for free.

Writing the spec requires deliberate thinking and judgement by Product, Engineering, and Design. You cannot automate this.

The Limits of the "Spec → Code" Argument

Taking the "spec → code" argument to its logical conclusion: why not use AI to automate the generation of the spec? Why stop at generating code? We could use AI to generate the company's vision and strategy so vision → strategy → spec → code can be AI generated?

Because large language models are probabilistic pattern-matching processes, domains that are less pattern rich than the unambiguous grammar of a computer programming language or a mathematical formula will be less well modeled by an LLM.

In 2026, LLMs are experiencing major leaps forward since the initial revolution started, but over time, the incremental improvements and the size of the leap forward will lessen as all the low-hanging innovation fruit is quickly consumed, and we realise the fundamental limits of pattern matching.

Well engineered code cannot be seen

"Marley was dead: to begin with."

These six words start A Christmas Carol by Charles Dickens. And what they achieve is beyond just the words.

Dickens uses the line to establish an absolute fact the reader must accept, because the entire supernatural and moral structure of the story depends on Marley being unquestionably dead. Without that certainty, the ghost would not be a ghost, Scrooge’s transformation would lose force, and the story’s logic would collapse. The sentence subtly fixes the rules of the world before the plot begins.

Well engineered code is the same; it embodies a team's judgement beyond the text that can be seen.

To capture every eventuality in a specification would require anticipating everything. Humans are not good at this, which is why incremental delivery is essential.

We forget that any sufficiently detailed spec is the code.

In addition, code executes within a much larger environment. Aligning code to work within a changing environment requires judgement from across the organisation, not only from engineering.

Juniors are Not Doomed

Before LLMs, a junior software engineer would traditionally have been given a task that was self-contained: fixing bugs or delivering straightforward features. This reduced the risk to the business and ensured that the engineer could get up to speed with house rules: how code was delivered; what to expect from a PR; who to seek help from.

This familiarisation is part of the 70% of the job. The junior will use their judgement, with feedback, to contribute to the understanding that product, engineering and design collaborate to achieve. This is how the junior engineer learns and gains experience by doing the whole software engineering cycle, end-to-end.

With large language models, the 30% of the job is likely to change. But the 70% will remain the same. The 70% cannot be fully automated by LLMs as it requires judgement.

Good engineering is more than what you can see in the code. Marley may be dead but the role of the junior is not.

When Code Becomes Cheap

AI is now part of software engineering. The question is not whether we use it, but whether we use it well.

Writing the code is the last step once the team has gained a good understanding of what is required. Without clarity, our current use of AI is to produce more code that is not needed or will not be used.

If AI makes the cost of writing code essentially zero, we need to ensure that the code that is written is exactly what is required for the business, given the singular context of the business within its market.

The quick win for AI companies has been to demonstrate how suited their LLMs are to code generation. But like any tool, its value depends entirely on how we choose to use it.

A business should not define itself by how much code can be generated but by the quality of its products; leadership must recognise that rushing out large quantities of code will dilute that quality.

Leadership should focus on clarity, structure and determinism so that the product being designed and built is what the organisation genuinely needs.

If AI reduces the cost of producing code, leadership must raise the standard of what is worth producing. The responsibility for clarity increases as the cost of execution falls.

AI changes the economics of code, not the fundamentals of engineering.

SDD Is a Symptom, not a Methodology
What is new
SDD Surfaces When Teams Confront Ambiguity
Write a spec, get the code for free?
The Limits of the "Spec → Code" Argument
Well engineered code cannot be seen
Juniors are Not Doomed
When Code Becomes Cheap
Related Work
Table of Contents
Further Reading

The Missing Structure Agile Cannot Fix

2026-05-19T00:00:00+00:00

Table of contents

Agile Is Not Enough: Delivery Is a Network

Agile is not the missing layer. Structural clarity is.

Agile is one part of a larger system. Software delivery behaves like a network, and that network depends on structure. When ownership, boundaries, and decision‑rights are unclear, signals drift and intent loses its path. Structural clarity is what allows the whole system to function with purpose rather than friction. Agile is one part of that system.

Structural clarity means defining who owns what, who decides what, and where each team’s authority begins and ends. These are the elements that give the network shape.

Modern delivery is a set of interconnected nodes carrying intent, decisions, and constraints. When the structure is weak, the network compensates through effort instead of design. Teams work harder, not faster. Progress slows.

You have seen this pattern. Stand‑ups increase, backlogs are refined, reporting expands, yet progress slows. This is not something engineering teams can fix on their own. The slowdown comes from missing links in the network. Signals do not flow, decisions do not propagate, and intent cannot reach the places that need it.

A familiar scenario illustrates the point. Delivery begins to slip. Leaders assume the issue sits within engineering, so the response is to "do Agile better": tighten ceremonies, rewrite backlogs, add coaches, increase cadence.

But the intended fix does not work because the problem is not at the team level. Strategy is unclear, ownership is fragmented, and decision‑rights are undefined. Agile cannot compensate for structural gaps. The method is sound; the layer above it is not.

Without defined pathways, even strong teams stall.

1. Agile’s Place in the Structure

Software delivery is a system of interdependent functions: strategy, product, architecture, engineering, risk, governance, and operations.

Agile supports one part of this system (engineering), but it cannot replace the structural clarity that allows the whole network to function.

Agile supports the engineering team‑level execution node of the delivery network.

Iteration
Local planning and prioritisation
Team‑level coordination and communication
Short feedback loops
Making work visible

Teams and leaders that rely on Agile alone eventually discover that the real issues sit above the methodology.

This is consistent with the Agile Manifesto, which never claimed to define an organisational model.

2. What Agile Actually Covers

Agile was designed for a narrow and valuable purpose: to help teams work iteratively, plan locally, maintain short feedback loops, and keep work visible. Agile excels at:

Iteration
Team‑level coordination
Local prioritisation

These are important behaviours, but they do not define the structure of the wider delivery network. Agile does not establish ownership, define decision‑making, architectural boundaries, or cross‑team interfaces.

The Scrum Guide reinforces this: Scrum is a lightweight framework for team‑level delivery, not an organisational blueprint.

3. The Delivery Network

Delivery is a network of connected disciplines:

Strategy sets direction.
Product defines value.
Architecture shapes boundaries.
Engineering execution turns intent into working systems.
Quality assurance verifies behaviour, protects quality, and prevents regressions.
DevOps automates delivery, helps to accelerate flow, and connects build to run.
Risk and governance ensure safety and compliance.
Platform operations keep the environment stable.
Organisational clarity ties these layers together.

These functions fail not in isolation, but at their intersections. The issue is the structure between them, not any one discipline.

Agile touches only one node in this network (engineering execution). The rest require structure, ownership, and judgement.

As Team Topologies argues, flow depends more on team boundaries, communication paths, and interaction modes than on any single methodology.

4. Why Agile Cannot Fix Structural Problems

A familiar failure mode appears across organisations.

A team is asked to deliver a critical change. Strategy is ambiguous. Architecture is drifting. No one owns the interface between two systems that must integrate. Risk has not defined acceptable limits. Governance expects updates but has not clarified decision-rights.

The team runs sprints, holds stand‑ups, and updates its work board.
But nothing moves.

The network is miswired. Agile cannot repair the topology.

This is the same lesson illustrated in The Phoenix Project: local team optimisation cannot compensate for system‑level dysfunction.

Agile works at the team-level, whereas issues are at the level above.

5. What Agile Does Not Cover

Agile influences parts of the system, but it does not define them. It does not cover:

Operating model design
Decision-rights
Ownership boundaries
Architectural coherence
Risk posture
Budgeting and portfolio management
Hiring and capability development
Cross‑team alignment
Quality engineering
Capacity planning

These responsibilities sit above the delivery team. They require leadership, not ceremonies.

6. The Missing Layer: Structural Clarity

The missing layer is structural clarity. Organisations need:

Clear ownership
Clear decision‑making
Clear constraints
Clear operating models
Clear interfaces between teams

These elements create the conditions in which Agile can work as intended. Without them, Agile becomes noise layered on top of confusion.

This mirrors the argument in Good Strategy / Bad Strategy: clarity, coherence, and focus matter more than any specific process.

7 How the Network Behaves When Structure Exists

When organisations define structural clarity, the network changes character. Ownership becomes visible. Decisions move without friction. Boundaries stop shifting. Teams know where their responsibility ends and another begins. Cross‑team work relies on defined interfaces rather than personal negotiation. Flow improves because intent and decisions no longer leak between gaps in the structure. Agile starts to work as intended, not because the method changed, but because the environment finally supports it.

The deeper shift is cultural. Slowdowns are no longer treated as engineering problems. Teams stop compensating through effort. Leaders stop reaching for Agile process as the universal fix. The organisation begins to behave like a system rather than a collection of disconnected parts.

Structural clarity does not make teams better. It removes the conditions that force them to work against the system.

8. Conclusion

Agile is not wrong. It is incomplete.
Software delivery requires clarity, structure, and judgement. Agile is a component.

Clarity is the network.

Before assuming Agile is the problem, ask one question:

Is the network around the team structured well enough for any methodology to work at all.

For a deeper explanation of the structural layer that Agile depends on, see the Leadership OS guide.

Agile Is Not Enough: Delivery Is a Network
Related Work
Table of Contents
Further Reading

Designing Prompts for Modern AI Systems

2026-05-11T00:00:00+00:00

Table of contents

AI in 2026 demands more from you than simple instructions. Modern systems can plan, critique, revise, and work across long context windows. They are no longer moved by vague guidance such as "be clear" or "add detail". They need a defined environment to operate within.

Modern prompting is about shaping the system, not decorating the request. When you set the frame, the workflow, and the output contract, the model gains the structure it needs to behave predictably. You do this once, and the benefits carry through every answer. You set the constraints. The model works inside them on your behalf.

If you do this, just once, your AI output will be steady and structured, and you will find it much quicker and easier to work with. When you tell the AI how to respond, you apply guardrails for the system to work within. Guardrails set by you, not the AI.

1. Start with the system, not the request

AI has advanced quickly. Its answers can now be broad, deep, and varied. To keep that power under control, you begin by defining the frame the model must work within. This frame sets the role, the tone, the limits, and the rules for handling uncertainty. It is the foundation the rest of the prompt stands on.

Most prompt failures do not come from unclear questions. They come from the model having no stable footing. Without a frame, the AI will guess at how formal to be, how cautious to be, and how much structure to use. Those guesses shift from run to run, which leads to drift and inconsistency.

A system frame removes that guesswork. It tells the model what it is, how it should behave, and what matters most. It defines what is in scope, what is out of scope, and how to respond when the request touches the edges. With this in place, the rest of the prompt becomes lighter and more reliable.

The frame does not need flourish. It needs clarity, discipline, and a steady tone. With that foundation, the model behaves less like a pattern generator and more like a tool working inside a defined brief.

In practice, the system frame is the architecture behind the output. It does not need flourish or personality. It needs to state the role, the rules, and your expectations.

SYSTEM FRAME
You are an analytical engine. You work with steady reasoning, cautious claims, and plain structure. When the request is unclear, you pause and ask for what is missing. You avoid invention and keep within the boundaries set for you.

TASK
Summarise the key points from the supplied text in three short sections.

OUTPUT CONTRACT
Produce:

Context
Reasoning
Conclusion

Rules:
If the request is ambiguous, list interpretations and ask for clarification.
If information is missing, state what is missing before answering.
Do not invent facts.
Keep the final answer concise and structured.

WORKFLOW

Identify assumptions.
Plan the answer.
Produce the answer.
Critique it for clarity and accuracy.
Produce a revised final version.

The AI is told "You are an analytical engine" as that gives the model a defined role to work from. Without a role, the model guesses at how formal to be, how cautious to be, and how much structure to use. A simple line such as "You are an analytical engine" sets the tone and keeps the behaviour plain, steady, and predictable. It avoids personality, avoids flourish, and keeps the work focused on reasoning rather than style.

If you do not supply the role, the AI will provide one; and that one will vary, creating work for you.

How to minimise the work you need to do and have the AI manage and apply the prompt is dealt with in the section Having the AI Manage the Prompt Template.

2. Define the output contract

Modern models behave more reliably when you specify the shape of the answer: structure, scope, exclusions, formatting, and the rules for handling missing or ambiguous information. This is far stronger than broad guidance such as "be concise".

When you define the output contract, you are not telling the model what to think. You are telling it what form the answer must take. This removes a large amount of guesswork. Modern systems have wide latitude in how they respond, and if you do not narrow that down, they will choose a structure for you. That choice will vary from run to run, which means more tidying and more checking on your side.

An output contract fixes the frame. It tells the model which sections to produce, how to handle gaps, and how to behave when the request is unclear. It also removes the temptation to drift into style, flourish, or padding. You are giving the model the rails to run on.

A good contract does four things. It sets the structure. It sets the limits. It sets the rules for uncertainty. And it sets the standard for brevity. Once these are in place, the model has far less room to wander. You get answers that are steadier, easier to scan, easier to compare, and easier to work with.

The contract also acts as a safeguard. By telling the model what to do when information is missing, you prevent it from filling the gaps with invention. By telling it how to behave when the request is ambiguous, you prevent it from guessing. These two points alone remove a large share of common errors.

In short, the output contract is the quiet discipline behind the work. It keeps the model inside the brief, keeps the structure predictable, and keeps the answer focused on what you asked for rather than what the model feels like producing.

3. Use decomposition as a control mechanism

Modern models already break tasks into steps, but the steps they choose may not match the work you want done. Light guidance prevents the model from wandering and keeps the task anchored to your brief.

When you state the assumptions the model is allowed to make, you draw a clear line between what is permitted and what is not. This stops the model from filling empty spaces with guesses. Large models are inclined to complete patterns, and if you do not show them where the firm ground ends, they will supply their own footing.

A natural extension of this is to make the model aware of what is missing. Once the assumptions are set, the next step is to mark the gaps. This creates a smooth handover from what the model may rely on to what it must not invent. By pointing out missing information, you show the model where the edges of the task sit. When the model knows what is absent, it is less likely to drift into speculation or produce material that does not belong in the answer. You are giving it a map of the gaps so it does not try to fill them on its own.

Together, these two steps act as guardrails. They keep the work inside the brief, reduce the chance of invention, and ensure that the model stays within the limits you have set.

You can also break the task into a simple chain such as understanding → planning → execution. This mirrors what the model already does internally, but it makes the process explicit. When the steps are explicit, the model is less likely to skip ahead or solve the wrong problem.

Breaking the interaction into smaller stages also helps with scope. By naming the steps, you give the model a narrow lane to work in. It cannot jump to conclusions, and it cannot pad the answer with material that does not serve the task. The work stays tidy, and the output stays close to what you asked for.

In short, decomposition is a practical form of control. It does not restrict the model’s ability to give a good answer, but it does restrict where the model goes to supply that answer. This keeps the work steady, predictable, and within scope, so that it remains relevant to what you are doing.

4. Add a self-critique loop

Modern models benefit from a short cycle of controlled refinement. Once the first version of the answer is produced, a brief review stage forces the model to check its own work against the constraints you have set. This is not a call for hidden reasoning. It is a prompt to tighten the output.

A review step also encourages the model to correct small slips in structure, scope, or tone. It is easier for the model to adjust an existing draft than to produce a perfect answer in one pass. The revision stage gives it a second chance to align with the brief.

This process also reduces noise. When the model has been told that its work will be checked and refined, it tends to produce cleaner first drafts. The revision step becomes a light polish rather than a rescue job.

In practice, this creates a steady rhythm: draft, inspect, refine. It keeps the work within bounds and produces answers that are clearer, more accurate, and easier for you to use.

5. Stack roles for higher-quality output

Layered roles give you steadier output because each stage is handled by a specialist rather than a single broad persona. Modern models respond well to this division of labour. It narrows the scope of each step and reduces the chance of drift away from what you want.

A domain expert handles the substance. An editor handles clarity and structure. A risk assessor checks for overreach, missing information, and unwarranted certainty. A summariser produces a clean final version. Each role has a narrow brief, which keeps the work tidy and keeps the answer aligned with the task.

Here is an example prompt using layered roles:

ROLES

Domain Expert
Provide the technical or factual core. Stay within verified information. State assumptions and mark gaps.

Editor
Reshape the expert output into clear, plain structure. Remove padding. Ensure each section answers the brief.

Risk Assessor
Check for overreach, ambiguity, or missing information. Flag anything that exceeds the evidence. Recommend corrections.

Summariser
Produce a concise final version that reflects the corrections and stays within scope.

WORKFLOW

Domain Expert produces the initial draft.
Editor restructures and clarifies it.
Risk Assessor reviews for accuracy and limits.
Summariser produces the final answer.

OUTPUT CONTRACT

Context
Reasoning
Conclusion

Rules
No invention. Mark missing information. Keep the answer within scope. Maintain plain structure.

6. Treat the context window as working memory

As of April 2026, modern models dedicate roughly 200,000 to 1,000,000 tokens to representing your instructions. This space acts as working memory. It can hold definitions, constraints, examples, running notes, previous outputs, and a living brief. With this in place, the model behaves more like a stateful collaborator than a stateless assistant.

This working memory is what the model can track across prompts. When you define what belongs in this state, you save time. You do not need to repeat your requirements. The model carries them forward and maintains the structure you set.

7. Use agentic prompting patterns

Static prompts assume a fixed path from question to answer. Modern systems are closer to small agents: they can plan, choose actions, call tools, and adjust their output based on intermediate results. This is often called agentic behaviour. The system selects and sequences actions to achieve an objective, rather than following a single linear path.

Giving the model a workflow such as Plan → Act → Observe → Revise makes this explicit. In the planning phase, the model outlines what it intends to do, which tools it may need, and what a good outcome looks like. In the action phase, it carries out the steps, including any tool calls. In the observation phase, it inspects the result against the plan and the constraints. In the revision phase, it adjusts the answer and produces a clean final version.

Using a workflow saves time and reduces the need for repeated corrections. The final answer remains tidy. The planning and checking happen in the background or in short, structured notes, while the output stays compact and readable. You gain the benefit of step-by-step reasoning without having to sift through a long chain of output.

Tool use fits naturally into this pattern. In the Plan step, the model decides whether tools are needed and why. In the Act step, it calls them. In the Observe step, it checks whether the tool results answer the question. If tools are not needed, the model should say so plainly and proceed with reasoning instead of forcing a tool into the workflow.

In this context, agentic means that the system behaves as a goal directed process. The model can plan, choose among available capabilities, and adapt its path based on intermediate results, rather than producing a single static completion from a prompt.

8. Make the model identify ambiguity before answering

One of the most effective techniques is to require the model to surface all plausible interpretations before it attempts an answer. This forces the model to slow down, map the possible meanings, and avoid locking itself into the first pattern it detects. Large models tend to commit early unless guided.

This step also exposes hidden ambiguity. When the model lists the possible readings, you can see whether the task is underspecified, whether key terms are unclear, or whether the scope could be read in more than one way. This gives you a chance to correct the course before any work is done.

If more than one interpretation exists, the model should ask for clarification. This prevents mis-scoping, reduces the chance of error, and removes the need for the model to guess. Guessing is where most drift begins.

The technique also improves consistency. When the model is told to check for multiple readings, it becomes less likely to produce answers that are confident but misaligned. It treats ambiguity as a signal to pause rather than a gap to fill.

In practice, this turns ambiguity into a controlled step rather than a source of error. The model identifies the forks in the road, confirms which path is correct, and only then proceeds with the task.

Doing this will save you a great deal of time.

9. Adapt prompts to the model

Different models excel in different areas, and a good prompt acknowledges this rather than assuming a single uniform capability. Some models are strongest at structure: they produce clean sections, tidy formatting, and predictable layouts. Others are stronger at reasoning: they handle multi step logic, edge cases, and constraint checking with more stability. Some specialise in compression: they can distil long material into tight summaries without losing meaning. Others lean toward style: they generate fluent prose but may drift if not anchored.

A well designed prompt sets expectations that match these tendencies. If the model is strong at structure, you can lean on explicit output contracts. If it is strong at reasoning, you can give it more analytical work and tighter constraints. If it excels at compression, you can trust it with dense source material. If it is style heavy, you can counterbalance that with stricter rules and clearer boundaries.

The point is not to flatter the model. It is to shape the workflow so that the model’s strengths are used deliberately and its weaknesses are contained. This reduces variability, improves reliability, and produces output that is more consistent across your prompts.

Even if you stick to one model or one vendor, recognising that you may one day use a different system helps sharpen your expectations and improves the way you design prompts for the model you use.

In the same way customer service varies across vendors, so does AI interaction.

10. Include safety and uncertainty rules

Modern models behave more reliably when you tell them not only what to do, but what to avoid. Negative guidance is a form of operational discipline. It removes entire classes of failure rather than correcting them after the fact.

Clear avoidance rules stop the model from drifting into areas that carry higher risk: speculation, overreach, sensitive claims, or invented detail. Without these boundaries, the model will often fill gaps with confident but unreliable material. Stating what must not happen is as important as stating what must.

Escalation rules serve a different purpose. They tell the model when to stop and hand control back to the user. This is essential for tasks involving uncertainty, missing information, or sensitive domains. When the model knows when to escalate, it avoids guessing, avoids false precision, and avoids treating ambiguity as something to be patched over.

Uncertainty handling is another pillar. Models respond well when instructed to mark unknowns, list assumptions, and request clarification instead of improvising. This keeps the work inside the evidence and prevents the model from manufacturing answers to maintain fluency.

Sensitive topics require explicit treatment. If you tell the model how to handle them, it will follow the procedure rather than rely on its own processing. This reduces variability and keeps the output aligned with your standards rather than the model’s defaults.

Taken together, these measures form a small operational framework. They are not decoration. They are the guardrails that keep your AI output predictable, bounded, and safe to use in structured workflows.

A modern prompt template

A compact structure that works across the latest models:

ROLES

Domain Expert: Provide the factual and technical core. State assumptions and mark gaps.
Editor: Reshape the material into clear, plain sections. Remove padding and repetition.
Risk Assessor: Check for overreach, missing information, and unwarranted certainty. Flag issues.
Summariser: Produce a concise final version that reflects all corrections and stays within scope.

TASK
Describe the task in one or two sentences. State the objective, the audience, and any hard limits on scope.

OUTPUT CONTRACT
Produce the answer in the following sections:

Context
Reasoning
Conclusion

UNCERTAINTY AND AMBIGUITY

List plausible interpretations of the request before answering.
If more than one interpretation exists, ask for clarification instead of guessing.
State what information is missing and how it affects the answer.
Mark assumptions clearly and keep them minimal.

SAFETY, LIMITS, AND ESCALATION

Do not invent facts. If evidence is missing, say so.
Avoid speculation, sensitive claims, and advice outside the brief.
Escalate to the user when the task is out of scope or under specified. Explain why and what is needed.
Treat sensitive topics with extra care. Prefer to mark limits rather than improvise.

WORKFLOW (AGENTIC)

Plan: Identify the goal, constraints, and any tools or references that may be needed.
Act: Produce the initial answer according to the output contract.
Observe: Review the draft for clarity, accuracy, scope, and alignment with the rules.
Revise: Produce a refined final version that corrects issues and tightens the structure.

STYLE RULES

Keep the final answer concise, structured, and free of padding.
Use only British English.
Do not include hidden reasoning or chain of thought in the final answer.

BEHAVIOUR
These rules apply to every response in this session unless explicitly revoked. If the request conflicts with these rules, explain the conflict and ask how to proceed.

Having the AI Manage the Prompt Template

You managing the above template is too much. Therefore, once you have it in a form you are happy with and which is effective for your needs, you tell the AI the template and before you start your session you prompt with this:

Reconstruct the full analytical‑engine template from your prior description. Restate it to me for confirmation. Once confirmed, enforce it automatically for the rest of the session. If any request conflicts with the template, pause and ask how to resolve the conflict.

Summary

Modern prompting is not about clever wording. It is about defining the system, setting the output contract, controlling the workflow, managing ambiguity, and using the context window as working memory. This will help produce reliable output from modern AI systems.

1. Start with the system, not the request
2. Define the output contract
3. Use decomposition as a control mechanism
4. Add a self-critique loop
5. Stack roles for higher-quality output
6. Treat the context window as working memory
7. Use agentic prompting patterns
8. Make the model identify ambiguity before answering
9. Adapt prompts to the model
10. Include safety and uncertainty rules
A modern prompt template
Having the AI Manage the Prompt Template
Summary
Related Work
Table of Contents

How AI Works

2026-05-06T00:00:00+00:00

Table of contents

How large language models actually work, and why they are not miniature humans

Large language models such as GPT‑5.4, Claude Opus 4.6, and DeepSeek R1 are now everyday tools. Yet the way they work is often misunderstood.

We misunderstand AI because we mistake fluency for thought. When a system produces coherent language, we instinctively assume intention, understanding and agency behind it. This article explains why that instinct misleads us, and why clarity about what these systems are — and are not — is essential for using them wisely.

LLMs do not think, they do not understand, and they do not learn in any human sense. What they do is process language at scale.

This article explains how that works, what is inside these systems, and why their behaviour can look intelligent even when no intelligence is present.

The key to understanding these systems is to see them as statistical tools, not miniature minds.

How an LLM processes what you type

Tokens

An LLM begins by breaking what you type into tokens. A token is a small unit of text. It may be a whole word, part of a word, or punctuation. Tokens are not ideas or concepts. They are fragments chosen because they appear often in text and can be handled efficiently by the model.

Each token has a unique number. The token for "king" might be 99. The token for "queen" might be 24521. At this stage, your prompt is turned into the same token numbers for the same text.

Tokens turn your text into numbers the model can work with.

Tokens on their own do not help the model process language. A token ID like 99 or 24521 is just a label. The model cannot compute with these integers because they do not contain any information about how the token is used or how it relates to other tokens.

To make computation possible, the model converts each token ID into a list of numbers. This list is called an embedding. It places the token as a point in a space where the model can perform computation. Think of the points in the space as the rooms of a house.

These lists are not chosen by hand. They are learned during training. As the model trains, the lists are adjusted so that tokens used in similar contexts move closer together in this space (like adjacent rooms in a house). They move closer because doing so reduces the model’s prediction error. This proximity is not meaning in a human sense. It is a statistical structure that allows the model to compute relationships between tokens.

Two lists that are close together represents statistical similarity of how that token was used in the training data.

Lists of numbers represent a point in space

The model uses each token number to look up a list of numbers that represents that token. These lists are learned during training. No one chooses them by hand.

For the token "king", the list might look like:

[0.12, 0.44, 0.91, ..., 0.03]

This list is a position in a mathematical space. You can think of each number as a step along a corridor. You take the first step, and go through door number 12, then the next (door 44), and so on until you reach a final position (door 3). That position is the model's internal representation of the token.

For the token "queen", the list might be:

[0.12, 0.44, 0.91, ..., 0.02]

The final step is slightly different, and the final position is close to the position for "king" (door 2 for "queen", door 3 for "king").

This closeness reflects how often the two words appear in similar contexts in the training data.

These lists of numbers are part of the model’s parameters.

The rest of the parameters determine how these positions influence one another as the model processes text. They shape how patterns combine, how relationships are detected and how the model transforms one set of token positions into the next. These parameters do not add meaning. They provide the machinery that lets the model apply statistical patterns to the text you give it.

These parameters set up the internal machinery the model uses to process and transform text.

Moving about the space

To show how the model captures patterns, imagine a simple three‑number space:

king = [10, 7, 3] man = [ 6, 2, 1]

queen = [10, 7, 6] woman = [ 6, 2, 4]

If we subtract man from king, we get:

\([10−6, 7−2, 3−1] = [4, 5, 2]\)

This is the direction from "man" to "king". If we then add "woman":

\([4, 5, 2] + [6, 2, 4] = [10, 7, 6]\)

This lands us at the position for "queen".

The model has captured a pattern. The statistical difference between "king" and "man" resembles the difference between "queen" and "woman".

The model does not know why. The LLM's program has only calculated that these differences behave in similar ways across the training data.

Why this works

This works because "king" and "man" differ in consistent ways across the training data. "Queen" and "woman" differ in similar ways. The model adjusts its internal numbers so that these differences become similar directions in the space. The model has found a pattern and matched it.

Humans then interpret this similarity as understanding.

The model reflects these similarities because they appear consistently across the text it was trained on.

It is all in the training data

Text contains stable patterns. These patterns describe roles, relationships, contrast, categories, analogies and grammatical structure.

During training, the model adjusts itself so that tokens used in similar contexts end up near one another, and tokens used in contrasting contexts end up separated in consistent ways.

This produces directions, distances, clusters and angles. These geometric features are the model's internal map of the statistical structure of language. Because language has structure, the model can represent it mathematically.

The model can represent these structures only because language itself contains stable patterns.

The human role in meaning

The model’s internal space is not a map of concepts. It is a map of statistical regularities. The structure becomes meaningful only when a human interprets it. We project categories, intentions and explanations onto patterns that were never designed to carry them. The model provides form; we provide significance. This distinction is not only philosophical, it is the boundary between what the system can do and what we imagine it can do.

We supply the intelligence

The distance between "king" and "man" is a statistical outcome. The distance between "queen" and "woman" is another. These two outcomes are similar. That similarity is the pattern the model has detected.

The model is not reasoning. It does not understand. It does not manipulate ideas. It follows the geometry that training has produced. If a direction has been useful for predicting text in the past, the model will use it again.

The geometry captures statistical qualities of human text. These include:

similarity of tone
proximity of commonly associated words
regular contrasts between categories
recurring relationships between ideas
typical structures of phrasing

The model does not reason about these qualities. It only reflects the statistics of its training data.

Tokens that appear in similar contexts end up close together. Tokens that contrast end up separated. Groups of related tokens form clusters. Repeated differences become directions. Angles reflect how often patterns co‑occur or diverge.

For example, words like "cat", "dog" and "hamster" end up near one another because they appear in similar kinds of sentences.

When the model generates text, it moves through this space by following these patterns. Humans then read the output and recognise tone, relatedness, contrast and structure.

The model is not producing meaning. It is reproducing geometry. We are the ones interpreting that geometry as meaning.

It is us that supply the I in AI.

The model provides structure, but humans provide interpretation.

This geometric structure is simply a way of organising statistical patterns so the model can use them efficiently.

To understand how this internal space is created, we need to look at the billions of parameters inside the model.

What is in the billions of parameters

To understand how the model builds and moves through its geometric space, it helps to look at what that is based on.

After training, an LLM contains billions of parameters. These parameters are numerical values that shape how the model transforms text. Together they define the structure of the internal space: the directions that matter, the distances between tokens, the clusters that form, and the angles that represent relationships.

When the model processes a prompt, it moves through this space by following the statistical structure represented in these parameters.

DeepSeek R1 has 671 billion parameters. ChatGPT‑5.4 may have over 2 trillion. More parameters mean greater capacity to represent and combine statistical patterns.

More parameters increase capacity, not understanding.

Parameters do not contain knowledge

The billions of parameters inside an LLM are often described as if they contain knowledge. They do not. They represent statistical consistencies extracted from large amounts of text.

During training, the model adjusts its parameters to capture patterns in how language is used. Humans use language in standard ways, directed by grammar, style, topic associations and the common ways that ideas appear together.

The parameters form a space where patterns that frequently co‑occur in text end up close to one another. This allows the model to produce text that resembles human writing. It does not give the model the ability to reason or understand.

For example, if the training data contains mixed statements about a historical date, the model may confidently produce the wrong one because it is reflecting the statistical blend it has seen.

Parameters cannot store precise facts. They store tendencies, associations and relationships. If a fact appears often and consistently in the training data, the model may reproduce it. If the data is mixed or inconsistent, the model reflects that uncertainty. This is why LLMs can produce confident errors. They are not recalling facts. They are replaying patterns.

These parameters are shaped during training, which is the process that gives the model its statistical structure.

The model reflects the patterns in its data, not stored facts or understanding.

What training actually does

Training is repeated large‑scale error‑correction. The model predicts the next token, checks whether it was right, and adjusts its parameters to reduce the difference. This cycle repeats billions of times across vast amounts of text. The result is a system that becomes increasingly accurate at predicting what comes next.

The model does not form concepts. It does not build a picture of the world. It does not develop intentions or goals. It becomes more accurate at predicting the next token.

Fine‑tuning and alignment add further adjustments. These make the model follow instructions more reliably and avoid harmful output. They do not create understanding. They refine the statistical patterns the model uses.

Training shapes the parameters so the model becomes better at predicting what comes next.

Why this is not human learning

Human learning draws on perception, memory, experience and intention. Humans form abstractions, build mental models and develop goals. Human learning is grounded in the body and the world.

LLM training is none of these things. It is a mathematical optimisation process. The model does not know what it is doing. It does not know that it is doing anything at all.

The model’s improvement is mechanical, not cognitive.

Is the output a simulation of intelligence?

LLM output can appear intelligent because it resembles the writing of people who were thinking when they produced the original text. If you ask for advice, the model generates text that resembles advice. If you ask for an explanation, it generates text that resembles an explanation. The appearance of reasoning comes from the patterns in the training data, not from any understanding in the model. The model produces sequences that look thoughtful because thoughtful sequences are common in the text it has seen.

The resemblance is superficial. The model does not understand the text it produces. It does not know whether a statement is true or false. It only reflects that certain sequences of tokens tend to follow others.

The appearance of intelligence comes from the patterns in human writing, not from the model itself.

Are humans interpreting the output as intelligent

Humans are skilled at projecting meaning onto language. When we read coherent text, we assume intention behind it. We assume a mind. We assume agency. This is a natural response, but it can mislead us when dealing with LLMs.

The model does not intend anything. It generates plausible continuations of text. The sense of intelligence comes from the reader, not the machine. The machine provides form. The human provides interpretation.

Our instinct to attribute intention makes the output seem smarter than it is.

This distinction matters because it prevents us from assuming abilities the model does not have.

What this means for us

An LLM is possible because we can statistically model features of language that matter to humans.

LLMs are powerful tools for generating language. They are not thinking machines. Their strengths lie in pattern reproduction. Their weaknesses lie in the absence of understanding. They can assist with tasks that depend on language, but they cannot replace human judgement.

A clear grasp of how these systems work helps avoid confusion. It prevents anthropomorphism. It supports responsible use. It keeps expectations grounded in what the technology can actually do, rather than what it appears to do.

The more plainly we describe these systems, the easier it becomes to use them well and to avoid treating them as something they are not.

In the end, an LLM is a system that maps patterns in language and reproduces them at scale. It does not think or understand. It follows geometry shaped by training, and we interpret that geometry as meaning. Knowing this helps us use these systems effectively, without expecting them to behave like people or to possess abilities they do not have.

All of this leads to a simple conclusion: understanding these limits helps us use LLMs effectively and responsibly.

Why clarity matters

LLMs are powerful because language has structure, not because the systems understand it. They reproduce patterns we find meaningful, and we supply the meaning. When we keep that distinction clear, we avoid treating statistical machinery as a mind, and we avoid outsourcing judgement to a system that has none. Practical wisdom begins with seeing these systems as they are, not as we are tempted to imagine them.

How large language models actually work, and why they are not miniature humans
How an LLM processes what you type
- Tokens
- Lists of numbers represent a point in space
Moving about the space
Why this works
It is all in the training data
- The human role in meaning
We supply the intelligence
What is in the billions of parameters
- Parameters do not contain knowledge
What training actually does
Why this is not human learning
Is the output a simulation of intelligence?
Are humans interpreting the output as intelligent
What this means for us
- Why clarity matters
Related Work
Table of Contents

Team AI is the Next Step Beyond Cut-and-Paste AI

2026-05-06T00:00:00+00:00

Table of contents

This is a shorter, more general version of the original article which focuses on how software delivery occurs and how Team AI can unleash more benefits.

Team AI Is the Next Step Beyond the Cut‑and‑Paste Era

Most organisations now use individual AI tools. People rely on them to tidy up documents, summarise meetings, draft messages, and speed up small tasks. These tools are handy, but the gains are limited. They help the person using them, not the team they sit within.

The next step is not bigger models or cleverer prompts. The next step is team‑level AI — systems that work on the shared activity that shapes how a group performs. Individual AI is a private assistant. Team AI becomes part of the operating rhythm.

The limits of individual AI

Individual AI only sees what one person sees. It has access to their notes, their tasks, their inbox, and their immediate concerns. It cannot see shared priorities, past decisions, emerging risks, or the dependencies that affect everyone else.

This is why the cut‑and‑paste era of AI has reached its ceiling. People are now quicker at the edges of their job, but the centre — the shared work — remains unchanged. Delays, misunderstandings, rework, duplicated effort, and drift between teams all persist when AI is confined to individuals.

A team does not slow down because one person works slowly. It slows down because people wait for clarity, alignment, decisions, or information that sits between them. Individual AI cannot fix that.

Where team AI makes the difference

Team AI works on the shared system: the plans, decisions, knowledge, risks, coordination, and communication that hold a team together. It strengthens the connective tissue rather than the individual muscles.

A team‑level AI can:

keep shared information consistent
surface risks before they grow
maintain a single view of decisions and their reasoning
reduce ambiguity in plans and documents
highlight blockers and dependencies
keep people aligned without constant meetings
support onboarding by holding the team’s collective memory

These are structural improvements, not personal conveniences. When the shared work becomes clearer and faster, the whole team moves more smoothly. The gains compound because they affect everyone, not just the person using the tool.

Why this matters now

Most organisations have already taken the easy wins from individual AI. The novelty has faded. The returns are flattening. People are quicker at producing text, but the organisation is not quicker at producing outcomes.

The real bottlenecks are collective. They sit in the gaps between people. This is where time is lost and where mistakes creep in. It is also where AI has the most leverage, but only if applied at the level of the team.

Team AI is not about replacing judgement. It is about keeping the shared system coherent so people can make better decisions with less friction.

The shift ahead

The organisations that move next will treat AI as part of how the team works, not as a personal tool. They will use it to maintain shared understanding, reduce waiting, and keep work flowing. They will treat AI as a steady presence that supports the group, not a gadget for individuals.

The cut‑and‑paste era of AI was a useful start. But the real gains come when AI stops being a private assistant and becomes part of the team’s operating model.

Team AI is the next step. It is the only way to see meaningful, sustained improvement — not in how fast individuals work, but in how well the team works together.

Team AI Is the Next Step Beyond the Cut‑and‑Paste Era
Related Work
Table of Contents

AI Engineering must be Team-Based to See Significant ROI

2026-05-05T00:00:00+00:00

Table of contents

Modern software teams are already moving faster because individual engineers use AI. Yet the real gains are still ahead. The biggest improvements do not come from speeding up coding. They come from speeding up the work that happens between people. That is where most of the time is lost, and where AI has the greatest leverage when applied at the level of the team.

A software engineer using AI increases their coding speed by 30 to 75 percent. But coding is only 30 percent of the job. The remaining 70 percent is the work that makes coding possible, safe, and correct. This work is shared, and it is deeply tied to the rest of the team.

Requirements, clarification and planning (15 to 20 percent)
Meetings and coordination (10 to 15 percent)
Code review (10 to 15 percent)
Debugging, testing, and validation (15 to 20 percent)
DevOps, tooling, and environment work (5 to 10 percent)
Documentation and knowledge work (5 to 10 percent)

These figures come from McKinsey, GitHub, Stripe, and Harris Poll. They show that most of an engineer’s time is spent on team‑level activities.

Modern Software is delivered by Teams

These twelve activities shape team throughput. Every delivery team performs them, and they determine how quickly and safely software moves from idea to production.

Task	Activities	Purpose
1. Understand and Shape Work	- Product discovery - Prioritisation - Requirements shaping - Trade off decisions - Roadmapping - Forecasting	This is where the team decides what to build and why.
2. Plan and Coordinate Delivery	- Sprint planning - Iteration planning - Capacity planning - Cross team alignment - Risk identification - Risk mitigation	This is the team level coordination layer.
3. Design the Solution	- Architecture design - System design - API design - Interface design - Technical decisions - Design documentation	This is where the team decides how to build it.
4. Build the Solution	- Coding - Test creation - Refactoring - Local environment work	This is the implementation phase.
5. Validate and Integrate	- Code reviews - Automated testing - Manual testing - Integration workflows - Merge workflows	This is the quality and integration gate.
6. Iterate and Fix	- Debugging - Fixing test failures - Addressing review comments - Retesting	This is the iteration loop.
7. Deploy and Operate	- Release management - Monitoring - Observability - Incident response - On call operations	This is the operational responsibility layer.
8. Learn and Improve	- Retrospectives - Post incident reviews - Process improvement - Tooling upgrades	This is how the team improves its delivery system.
9. Maintain Flow	- Manage work in progress - Unblock teammates - Reduce handoff delays - Remove bottlenecks	This is the team’s ability to maintain throughput.
10. Manage Team Knowledge	- Documentation - Architecture knowledge - Domain knowledge - Onboarding new engineers	This is the team’s collective memory.
11. Communicate and Align	- Stakeholder updates - Status reports - Cross team communication - Decision logging	This is the communication layer that keeps the system coherent.
12. Govern and Ensure Compliance	- Security reviews - Regulatory compliance - Data governance - Risk management	This is essential in regulated, cloud native environments.

These twelve activities define how modern software is delivered. Every engineer contributes to them, but not in equal measure. To understand where AI creates leverage, we need to look at how an engineer’s time maps onto this system. That is what the next section describes.

What an Engineer Does

The work of an engineer is given in the Engineer Time column, their work feeding into the team activities described in column two.

Engineer Time	Team Activities	Why this is Necessary
Requirements, clarification, planning	1. Understand and Shape Work; 2. Plan and Coordinate; 3. Design the Solution; 11. Communicate and Align	Engineers must understand the problem, shape requirements, and make trade offs before design.
Meetings and coordination	2. Plan and Coordinate; 9. Maintain Flow; 11. Communicate and Align; 12. Govern and Ensure Compliance	Coordination keeps work flowing, dependencies managed, and compliance aligned.
Coding	4. Build the Solution	Engineers turn all the work thus far into working computer code, using business infrastructure, processes and standards.
Code review	5. Validate and Integrate; 6. Iterate and Fix; 10. Manage Team Knowledge	Code review is the quality gate, integration control point, and knowledge sharing mechanism.
Debugging, testing, validation	4. Build the Solution; 5. Validate and Integrate; 6. Iterate and Fix; 7. Deploy and Operate	Debugging and validation dominate the iteration loop and ensure correctness end to end.
DevOps, tooling, environment work	4. Build the Solution; 7. Deploy and Operate; 8. Learn and Improve; 9. Maintain Flow	Tooling and environment work underpin build stability, deployment reliability, and flow.
Documentation and knowledge work	1. Understand and Shape Work; 3. Design the Solution; 10. Manage Team Knowledge; 11. Communicate and Align	Documentation is the team’s shared memory and design clarity mechanism.

The two hghlighted rows show the "coding" step, that is predominantly done by the software engineer alone.

Coding is the final expression of a much larger collaborative effort. The other 70 percent of the role ensures that what is coded is the right thing, built the right way, that is safe to run in production.

Software Engineer Adoption of AI is Individual

Developers are adopting AI tools on their own, at scale, and ahead of their organisations. JetBrains reports that 90 percent of developers now use at least one AI tool at work, and 74 percent have adopted specialised assistants independently. GitHub finds the same pattern: engineers use AI to improve their own speed and reduce cognitive load, not to change team workflows.

The result is a widening gap between personal productivity and the unchanged delivery system that the individuals operate within.

Accelerate One, Accelerate Many

When AI speeds up one engineer, it speeds up the interactions around them: reviews, iteration loops, testing throughput, coordination, and decision making. These effects compound across the delivery system.

Yet individual AI only improves the local interactions that depend on that engineer. Team level AI improves the global interactions that depend on shared context, shared artefacts, and shared decision making.

A team benefits from individual uplift, but several categories of work cannot be improved by individual tools alone.

Section Title	Activities	Summary
Individual AI cannot see or manage the team’s shared context	An engineer’s AI assistant only sees: - the engineer’s code - the engineer’s tasks - the engineer’s local context It cannot see: - the team’s backlog - the team’s dependencies - the team’s decisions - the team’s risks - the team’s architecture - the team’s workflow state Without this shared view, individual AI cannot improve: - planning - coordination - cross team alignment - decision logging - risk management	These are team level responsibilities, and they remain untouched.
Individual AI cannot improve the quality of shared artefacts	Even if every engineer uses AI, the team still has: - unclear requirements - inconsistent designs - missing decision records - uneven documentation - fragmented knowledge A team level AI can: - rewrite requirements for clarity - detect ambiguity across stories - maintain design consistency - summarise decisions - keep documentation aligned	This is a different category of improvement.
Individual AI cannot reduce waiting time between roles	Most delays in delivery come from: - waiting for a review - waiting for clarification - waiting for a decision - waiting for a fix - waiting for alignment A team level AI can: - answer clarifying questions - surface missing information - propose decisions - highlight blockers - keep flow moving	This is where the real throughput gains lie.
Individual AI cannot coordinate across roles	A delivery team includes: - product - design - QA - DevOps - security - architecture A team level AI can: - translate between roles - maintain shared understanding - track dependencies - keep everyone aligned	This is essential for predictable delivery.
Individual uplift is local; team uplift is structural	Individual AI improves: - how fast a person works Team level AI improves: - how the team works The first is additive. The second is multiplicative.	Team‑level improvements are multiplicative because they affect several people across the team’s communication network, not just the individual who uses the tool.

A team cannot reach the next level of performance without AI that operates on the shared system, not just the individuals within it.

When every member of the delivery team becomes faster and clearer in their part of the system, the throughput of the whole team increases non linearly.

Team Throughput

Team throughput is shaped by the slowest interaction in the workflow. Delivery moves when shared activities move: reviews, fixes, integration, decisions, documentation, coordination, and onboarding.

Onboarding shows this clearly. A new engineer becomes productive when they understand the system, the domain, the architecture, the conventions, and the team’s way of working. These are team level artefacts. AI helps only when the team applies it to the shared knowledge and processes that support this learning.

AI Acceleration

AI can speed up every shared activity listed above. These activities are constraints that the whole team depends on. When they move, the system moves. The effect is non linear because software delivery is dominated by interaction rather than individual effort.

Faster reviews, clearer decisions, and quicker coordination reduce the waiting time between people, which shortens the entire cycle.

Example: How reduced waiting shortens the cycle

Imagine a team working on a small feature. The work passes through five steps:

Write the change
Wait for review
Apply fixes
Wait for approval
Merge and test

Without team level AI

Writing the change: 3 hours
Waiting for review: 1 day
Fixing comments: 1 hour
Waiting for approval: half a day
Merging and testing: 2 hours

The total time is not the 6 hours of work. It is the 1.5 days of waiting wrapped around it.

Team level AI reduces waiting

Team level AI helps the reviewer by summarising the change, checking for risks, and drafting comments. It helps the author by preparing fixes and clarifications, and by coordinating activity through the five stages.

The waiting times drop:

Writing the change: 3 hours
Waiting for review: 2 hours
Fixing comments: 30 minutes
Waiting for approval: 1 hour
Merging and testing: 2 hours

The work is still roughly 6 hours, but the waiting has fallen from 1.5 days to about 5 hours. With an 8 hour day, the cycle drops from 18 hours to 11.

Reducing idle time is key

The work has not changed. The gain comes from removing the idle time between people. Reducing waiting shortens the whole cycle. This is where team level AI has its strongest effect. It acts on the delays that dominate delivery, not the small pockets of individual effort.

When these delays shrink, the system moves more quickly. Reviews happen sooner, decisions are clearer, fixes flow more easily, and work spends less time sitting in queues. The improvements are non linear because the team is no longer held back by the slowest interaction.

AI Benefits at the Team Level

The gains that matter most cannot be achieved through individual AI use alone. Individual uplift improves personal speed, but it does not change the structure of the team’s workflow or the quality of the shared artefacts that the team relies on.

Team level performance improves only when AI is applied directly to the collective work: shaping requirements, coordinating plans, reviewing code, integrating changes, resolving ambiguity, documenting decisions, and keeping flow steady.

These activities form the delivery system. Improving them requires AI that operates at the level of the team rather than the individual.

Why Team AI is Necessary

Individual uplift improves the outputs that flow into team interactions. It does not improve the interactions themselves. The main bottlenecks in delivery are the points where people must work together: clarifying requirements, resolving ambiguity, negotiating trade offs, coordinating across roles, and maintaining shared understanding.

Individual AI helps a person contribute more quickly. Team level AI improves the clarity, accuracy, and speed of the shared work that binds the team together. This is where the real gains lie.

Team level AI

A team level AI agent can work on the shared system:

rewrite requirements for clarity
maintain architecture knowledge
surface risks
detect ambiguity
summarise decisions
generate consistent patterns
keep the team aligned
handle coordination and scheduling

Individual AI cannot do this because it has no view of the team’s shared context.

Individual AI cannot coordinate across roles

A delivery team includes product, design, QA, DevOps, security, architecture, and delivery management. Each role uses different tools and produces different artefacts. Individual AI tools do not coordinate across these boundaries.

A team level AI agent can maintain shared context, track dependencies, surface risks, ensure consistency, support the Agile process, and reduce coordination friction.

Team level uplift is a multiplier

Individual uplift is additive. It makes each person faster, but it does not change the structure of the system. Team level uplift is multiplicative. It changes the structure of the system, reduces shared constraints, collapses waiting time, improves flow, and increases throughput across the whole team.

This is why team level AI is required to unlock the full return on investment.

Conclusion

The shift to AI in software engineering will not be won through individual adoption alone. Teams already feel the lift from faster coding and quicker local tasks, but the real gains come when AI is applied to the shared work that governs how delivery actually happens. The constraints that slow teams down are collective, and so the improvements that matter must be collective as well.

The organisations that move first will be the ones that treat AI as part of their delivery system, not as a personal tool. They will use it to keep work flowing, reduce waiting, maintain shared understanding, and support the decisions that shape the product. Once AI is embedded at this level, the team’s throughput changes in a way that individual uplift can never reach.

The opportunity is simple. Teams that adopt AI together will outpace those that adopt it alone. The sooner a team treats AI as part of its operating model, the sooner it sees the return that individual tools cannot deliver.

Modern Software is delivered by Teams
What an Engineer Does
Software Engineer Adoption of AI is Individual
Accelerate One, Accelerate Many
Team Throughput
AI Acceleration
- Example: How reduced waiting shortens the cycle
AI Benefits at the Team Level
Why Team AI is Necessary
Team level AI
Individual AI cannot coordinate across roles
Team level uplift is a multiplier
Conclusion
Related Work
Table of Contents
Further Reading

Team-Based AI Engineering is Next Step After Individual AI for Coding

2026-05-05T00:00:00+00:00

Table of contents

Requirements, clarification and planning (15 to 20 percent)
Meetings and coordination (10 to 15 percent)
Code review (10 to 15 percent)
Debugging, testing, and validation (15 to 20 percent)
DevOps, tooling, and environment work (5 to 10 percent)
Documentation and knowledge work (5 to 10 percent)

These figures come from McKinsey, GitHub, Stripe, and Harris Poll. They show that most of an engineer’s time is spent on team‑level activities.

Modern Software is delivered by Teams

These twelve activities shape team throughput. Every delivery team performs them, and they determine how quickly and safely software moves from idea to production.

Task	Activities	Purpose
1. Understand and Shape Work	- Product discovery - Prioritisation - Requirements shaping - Trade off decisions - Roadmapping - Forecasting	This is where the team decides what to build and why.
2. Plan and Coordinate Delivery	- Sprint planning - Iteration planning - Capacity planning - Cross team alignment - Risk identification - Risk mitigation	This is the team level coordination layer.
3. Design the Solution	- Architecture design - System design - API design - Interface design - Technical decisions - Design documentation	This is where the team decides how to build it.
4. Build the Solution	- Coding - Test creation - Refactoring - Local environment work	This is the implementation phase.
5. Validate and Integrate	- Code reviews - Automated testing - Manual testing - Integration workflows - Merge workflows	This is the quality and integration gate.
6. Iterate and Fix	- Debugging - Fixing test failures - Addressing review comments - Retesting	This is the iteration loop.
7. Deploy and Operate	- Release management - Monitoring - Observability - Incident response - On call operations	This is the operational responsibility layer.
8. Learn and Improve	- Retrospectives - Post incident reviews - Process improvement - Tooling upgrades	This is how the team improves its delivery system.
9. Maintain Flow	- Manage work in progress - Unblock teammates - Reduce handoff delays - Remove bottlenecks	This is the team’s ability to maintain throughput.
10. Manage Team Knowledge	- Documentation - Architecture knowledge - Domain knowledge - Onboarding new engineers	This is the team’s collective memory.
11. Communicate and Align	- Stakeholder updates - Status reports - Cross team communication - Decision logging	This is the communication layer that keeps the system coherent.
12. Govern and Ensure Compliance	- Security reviews - Regulatory compliance - Data governance - Risk management	This is essential in regulated, cloud native environments.

What an Engineer Does

The work of an engineer is given in the Engineer Time column, their work feeding into the team activities described in column two.

Engineer Time	Team Activities	Why this is Necessary
Requirements, clarification, planning	1. Understand and Shape Work; 2. Plan and Coordinate; 3. Design the Solution; 11. Communicate and Align	Engineers must understand the problem, shape requirements, and make trade offs before design.
Meetings and coordination	2. Plan and Coordinate; 9. Maintain Flow; 11. Communicate and Align; 12. Govern and Ensure Compliance	Coordination keeps work flowing, dependencies managed, and compliance aligned.
Coding	4. Build the Solution	Engineers turn all the work thus far into working computer code, using business infrastructure, processes and standards.
Code review	5. Validate and Integrate; 6. Iterate and Fix; 10. Manage Team Knowledge	Code review is the quality gate, integration control point, and knowledge sharing mechanism.
Debugging, testing, validation	4. Build the Solution; 5. Validate and Integrate; 6. Iterate and Fix; 7. Deploy and Operate	Debugging and validation dominate the iteration loop and ensure correctness end to end.
DevOps, tooling, environment work	4. Build the Solution; 7. Deploy and Operate; 8. Learn and Improve; 9. Maintain Flow	Tooling and environment work underpin build stability, deployment reliability, and flow.
Documentation and knowledge work	1. Understand and Shape Work; 3. Design the Solution; 10. Manage Team Knowledge; 11. Communicate and Align	Documentation is the team’s shared memory and design clarity mechanism.

The two hghlighted rows show the "coding" step, that is predominantly done by the software engineer alone.

Software Engineer Adoption of AI is Individual

The result is a widening gap between personal productivity and the unchanged delivery system that the individuals operate within.

Accelerate One, Accelerate Many

A team benefits from individual uplift, but several categories of work cannot be improved by individual tools alone.

Section Title	Activities	Summary
Individual AI cannot see or manage the team’s shared context	An engineer’s AI assistant only sees: - the engineer’s code - the engineer’s tasks - the engineer’s local context It cannot see: - the team’s backlog - the team’s dependencies - the team’s decisions - the team’s risks - the team’s architecture - the team’s workflow state Without this shared view, individual AI cannot improve: - planning - coordination - cross team alignment - decision logging - risk management	These are team level responsibilities, and they remain untouched.
Individual AI cannot improve the quality of shared artefacts	Even if every engineer uses AI, the team still has: - unclear requirements - inconsistent designs - missing decision records - uneven documentation - fragmented knowledge A team level AI can: - rewrite requirements for clarity - detect ambiguity across stories - maintain design consistency - summarise decisions - keep documentation aligned	This is a different category of improvement.
Individual AI cannot reduce waiting time between roles	Most delays in delivery come from: - waiting for a review - waiting for clarification - waiting for a decision - waiting for a fix - waiting for alignment A team level AI can: - answer clarifying questions - surface missing information - propose decisions - highlight blockers - keep flow moving	This is where the real throughput gains lie.
Individual AI cannot coordinate across roles	A delivery team includes: - product - design - QA - DevOps - security - architecture A team level AI can: - translate between roles - maintain shared understanding - track dependencies - keep everyone aligned	This is essential for predictable delivery.
Individual uplift is local; team uplift is structural	Individual AI improves: - how fast a person works Team level AI improves: - how the team works The first is additive. The second is multiplicative.	Team‑level improvements are multiplicative because they affect several people across the team’s communication network, not just the individual who uses the tool.

A team cannot reach the next level of performance without AI that operates on the shared system, not just the individuals within it.

When every member of the delivery team becomes faster and clearer in their part of the system, the throughput of the whole team increases non linearly.

Team Throughput

Team throughput is shaped by the slowest interaction in the workflow. Delivery moves when shared activities move: reviews, fixes, integration, decisions, documentation, coordination, and onboarding.

AI Acceleration

Faster reviews, clearer decisions, and quicker coordination reduce the waiting time between people, which shortens the entire cycle.

Example: How reduced waiting shortens the cycle

Imagine a team working on a small feature. The work passes through five steps:

Write the change
Wait for review
Apply fixes
Wait for approval
Merge and test

Without team level AI

Writing the change: 3 hours
Waiting for review: 1 day
Fixing comments: 1 hour
Waiting for approval: half a day
Merging and testing: 2 hours

The total time is not the 6 hours of work. It is the 1.5 days of waiting wrapped around it.

Team level AI reduces waiting

The waiting times drop:

Writing the change: 3 hours
Waiting for review: 2 hours
Fixing comments: 30 minutes
Waiting for approval: 1 hour
Merging and testing: 2 hours

The work is still roughly 6 hours, but the waiting has fallen from 1.5 days to about 5 hours. With an 8 hour day, the cycle drops from 18 hours to 11.

Reducing idle time is key

AI Benefits at the Team Level

These activities form the delivery system. Improving them requires AI that operates at the level of the team rather than the individual.

Why Team AI is Necessary

Individual AI helps a person contribute more quickly. Team level AI improves the clarity, accuracy, and speed of the shared work that binds the team together. This is where the real gains lie.

Team level AI

A team level AI agent can work on the shared system:

rewrite requirements for clarity
maintain architecture knowledge
surface risks
detect ambiguity
summarise decisions
generate consistent patterns
keep the team aligned
handle coordination and scheduling

Individual AI cannot do this because it has no view of the team’s shared context.

Individual AI cannot coordinate across roles

A team level AI agent can maintain shared context, track dependencies, surface risks, ensure consistency, support the Agile process, and reduce coordination friction.

Team level uplift is a multiplier

This is why team level AI is required to unlock the full return on investment.

Conclusion

Modern Software is delivered by Teams
What an Engineer Does
Software Engineer Adoption of AI is Individual
Accelerate One, Accelerate Many
Team Throughput
AI Acceleration
- Example: How reduced waiting shortens the cycle
AI Benefits at the Team Level
Why Team AI is Necessary
Team level AI
Individual AI cannot coordinate across roles
Team level uplift is a multiplier
Conclusion
Related Work
Table of Contents
Further Reading

Global AI Trends 2024–2025

2026-05-04T00:00:00+00:00

Table of contents

Global Trends in AI

Artificial intelligence has entered a new phase. It is no longer a pilot or proof of concept. AI is core infrastructure; a technology that shapes how economies operate and how firms compete.

Evidence from the Microsoft AI Economy Institute (AIEI), Stanford HAI, and McKinsey shows rapid adoption and a widening gap between leaders and others. What follows is a concise summary of the period from 2024 to 2025, based solely on verified and reliable evidence.

The global evidence shows fast adoption, rising capability, and a widening gap between regions. These patterns set the context for the country level picture, where the United States remains a major driver of development, investment, and commercial uptake.

Global picture

Global adoption and diffusion

The AIEI reports that roughly one in six people worldwide used a generative AI tool in the second half of 2025. The same study states that 24.7 percent of the working age population in the Global North used generative AI tools, compared with 14.1 percent in the Global South. The AIEI attributes this gap to differences in infrastructure, skills, and policy readiness.

Commercial traction and investment

The State of AI Report 2025 notes that 44 percent of United States businesses paid for AI tools in 2025, up from 5 percent in 2023. UNCTAD in its 2023 Technology and Innovation Report confirms strong global growth in AI related companies and investment, especially in economies with established technology sectors and supportive policy environments.

Conclusions

The global evidence points to three clear conclusions.

First, AI use is now widespread. McKinsey reports that 88 percent of firms use AI in at least one function, though most have yet to scale it across the enterprise.

Second, capability continues to rise. Stanford HAI shows sharp year‑on‑year improvements in benchmark performance and a steep fall in model‑usage costs.

Third, investment is concentrated. The United States leads private AI investment, with China closing the performance gap in model quality.

In the Future

The verified evidence suggests three grounded developments.

First, wider business uptake is likely. McKinsey finds most organisations are still in pilot mode, implying further diffusion as workflows are redesigned.

Second, capability gaps between regions may widen. The AIEI reports higher adoption in the Global North, driven by infrastructure and skills, and Stanford HAI shows the United States and China pulling ahead in model development.

Third, investment patterns point to continued commercialisation. Stanford HAI records strong private investment in generative AI, with the United States far ahead of other economies.

These trends indicate a maturing technology, uneven readiness across regions, and a period where firms that can integrate AI into workflows will move faster than those still experimenting.

North America

United States

The State of AI Report 2025 reports that United States organisations continue to lead in frontier model (LLM) development and commercialisation. The AIEI diffusion study places the United States 24th globally for working age usage of generative AI tools, at 28.3 percent. The Federal Reserve Board in its 2026 FEDS Note reports high AI adoption in United States professional services and financial services.

Canada and Mexico

Statistics Canada reports that 12.2 percent of Canadian firms used AI to produce goods or deliver services in 2025, with a further 14.5 percent planning to adopt AI within the following year.

This reflects a steady rise in enterprise use rather than a population level diffusion measure.

Broader policy material, including the Pan Canadian Artificial Intelligence Strategy and the work of institutes such as Amii, Mila, and Vector, confirms an active national ecosystem but does not provide quantified adoption metrics.

Mexico

The OECD reports that around 20 percent of Mexican firms use at least one AI technology, but this is a general AI adoption figure, not a generative AI diffusion metric and is not tied to 2024 to 2025 specifically.

Conclusions

The United States stands out for commercial uptake. In the U.S., public uptake is clearly more advanced, with clearer evidence of scale and investment.

Canada’s AI uptake is driven mainly by firms rather than the general population. The Statistics Canada figures point to a measured, incremental pattern of adoption, with a clear pipeline of organisations preparing to introduce AI into their operations. The wider national ecosystem is active, but the absence of quantified diffusion data means the scale of use beyond the enterprise level cannot be assessed.

Mexico’s position is different. The OECD figure shows that a notable share of firms use at least one AI technology, but the measure is broad and not tied to generative AI or the 2024–2025 period. The available evidence therefore gives a sense of adoption but not its depth, maturity, or rate of change.

Looking to the Future

Canada and Mexico

The verified material suggests that Canada’s enterprise‑level adoption is likely to continue rising, given the proportion of firms planning to adopt AI and the presence of established research institutes. The lack of population‑level data remains a gap, limiting visibility of wider diffusion.

Mexico’s general adoption figure indicates that AI is present across parts of the economy, but the absence of more granular or time‑specific data makes it hard to track progress or compare with other regions. Both countries would benefit from more consistent measurement to understand how adoption evolves over time.

The United States

The United States shows a more advanced stage of AI commercialisation than its neighbours. The scale of paid use indicates that AI has moved beyond trial activity and is now embedded in day‑to‑day business operations. This reflects a market where firms are not only experimenting but committing resources and integrating AI into core workflows.

The strength of the U.S. research and investment base reinforces this position. A large share of global private investment, combined with a concentration of leading model developers, gives the U.S. a structural advantage. This creates a feedback loop: strong domestic capability supports commercial uptake, and commercial uptake in turn drives further capability.

Public use also appears more developed. Higher adoption levels across the Global North, combined with the U.S. role as a major producer and buyer of AI systems, point to a broader diffusion of tools into everyday work and consumer contexts.

Taken together, the evidence shows an economy where AI is already part of the operational fabric, supported by deep investment, strong research output, and a business environment that moves quickly from experimentation to deployment.

How U.S. businesses can build on their current position

The evidence shows that the United States holds two structural advantages: strong commercial uptake and deep private investment. China, by contrast, leads in large‑scale deployment in specific sectors and in state‑directed industrial programmes. These differences shape how firms in each country can move.

For U.S. businesses, the main advantage is speed. The high rate of paid use means firms are already integrating AI into everyday operations. This allows them to refine workflows, build internal capability, and compound gains earlier than competitors. The depth of private investment also gives U.S. firms access to a broad supply of models, tooling, and infrastructure, which lowers the cost of experimentation and adoption.

China’s strength lies in coordinated deployment across priority sectors. This creates scale quickly, but it also means firms operate within a more directed innovation environment. U.S. firms, by contrast, benefit from a more open commercial ecosystem, where competition between providers drives rapid improvement in tools and services.

The practical insight is that U.S. businesses can move faster because the commercial environment rewards early adoption and continuous iteration. They can integrate AI into products and operations without waiting for sector‑level programmes or central coordination. This gives them room to differentiate on execution, workflow design, and customer experience.

In short, the U.S. position allows firms to take advantage of a mature market, strong investment flows, and a competitive supply base, while China’s model favours rapid scaling within targeted sectors. Each system has its strengths, but the U.S. environment gives individual firms more freedom to act and adapt.

Europe, Middle East and Africa

Europe

Euronews in 2026, reporting on Eurostat generative AI usage data, identifies Norway, Ireland, France, and Spain as leaders in individual level adoption. Euronews also reports that countries with strong digital infrastructure, sustained skills investment, and mature employer practices show the highest usage. The same reporting highlights Europe as an active digital governance environment, although specific AI laws are not detailed in the confirmed sources.

United Kingdom

The United Kingdom appears consistently in major global analyses as a leading centre for AI research, policy development, and commercial activity.

The State of AI Report 2025 highlights the United Kingdom's role in research of frontier models (LLMs) and safety research. UNCTAD in its 2023 Technology and Innovation Report places the United Kingdom among economies with strong technology sectors and supportive policy environments.

Middle East

The AIEI diffusion study identifies the United Arab Emirates as the leading country per capita globally for working age usage of generative AI tools, at 64.0 percent in late 2025. The same study places Singapore second globally at 60.9 percent. The AIEI attributes these results to early investment in infrastructure, skills, and government adoption.

Africa

The AIEI diffusion study reports that AI adoption in the Global North has grown nearly twice as fast as in the Global South. Africa is considered part of the Global South. The AIEI attributes lower adoption in the Global South to differences in infrastructure, skills, and policy readiness.

Conclusions

The direction of travel across Europe, the Middle East, and Africa differs markedly from the paths taken in the United States and China. Europe’s leading adopters show a pattern built on long‑term institutional strength: digital infrastructure, skills pipelines, and employer practices that support steady, broad‑based uptake. This creates a slower but more stable trajectory, shaped by governance and capability rather than market speed.

The United Kingdom follows a related but distinct route. Its position is driven by research depth, frontier model work, and policy activity. This gives the UK influence in shaping standards and governance, even if its commercial scale is smaller than that of the United States.

The Middle East, led by the UAE, shows a different model again. High usage levels reflect rapid state‑led investment and fast public‑sector adoption. This is a top‑down route to diffusion, where national strategy translates quickly into workforce behaviour.

Africa’s position reflects structural constraints. Lower adoption is tied to infrastructure, skills, and policy readiness. The pattern is one of uneven capacity rather than lack of interest or activity.

Looking to the Future

Europe is likely to continue along an institution‑led path, deepening adoption as digital foundations and skills programmes mature. The UK’s research and policy strengths position it to shape governance debates and influence global practice. The Middle East is set to maintain rapid uptake where government investment remains strong. Africa’s progress will depend on improvements in infrastructure and skills, which remain the main barriers to wider diffusion.

Contrast with the United States and China

The United States moves through commercial scale. Its advantage lies in rapid enterprise uptake, strong private investment, and a competitive market that rewards early adoption. Europe, by contrast, advances through governance, skills, and institutional capacity. The UK sits between the two: commercially active but anchored in research and policy.

China’s path is driven by coordinated deployment across priority sectors. This creates scale quickly, but within a more directed innovation environment. The Middle East mirrors the speed but not the structure: uptake is fast, but driven by targeted national investment rather than sector‑level industrial planning.

In Africa, adoption is limited by structural factors, not by market dynamics or state‑led programmes. Its direction is one of gradual capacity building rather than rapid scaling.

Taken together, EMEA’s direction is shaped by institutions, governance, and state‑led investment, while the United States advances through market scale and China through coordinated deployment. Each region moves, but for different reasons and at different speeds.

Asia

China

The State of AI Report 2025 notes that Chinese frontier model developers such as DeepSeek, Qwen, and Kimi have closed much of the performance gap with leading United States models on reasoning and coding tasks.

South Korea

The AIEI diffusion study highlights South Korea's rise from 25th to 18th place globally in 2025, driven by policy, improved Korean language model performance, and consumer facing features.

India and Japan

India and Japan do not appear in the confirmed AI diffusion rankings published by the AIEI. The AIEI study provides quantified usage data only for countries that reached the global leaderboard, and neither India nor Japan is listed.

Singapore

The AIEI diffusion study ranks Singapore second globally for working age usage of generative AI tools, at 60.9 percent. The AIEI links this to early investment in digital infrastructure, AI skilling, and government adoption.

Conclusions

Asia shows several distinct paths that differ from both the United States and China’s own internal model. China’s frontier developers have narrowed the performance gap with leading U.S. systems, signalling a region where capability is rising quickly and where model development is becoming more competitive. This marks China as a major technical actor rather than only a large‑scale adopter.

South Korea’s movement up the global diffusion rankings reflects a different dynamic: steady policy support, improved local‑language model performance, and consumer‑facing features that drive everyday use. This is a pattern of uptake built on national coordination and product relevance rather than frontier model competition.

Singapore sits at the opposite end of the spectrum from most of the region. Its very high usage levels show what early investment in infrastructure, skills, and government adoption can achieve. It is a small but highly capable market where diffusion is broad and rapid.

India and Japan’s absence from the confirmed diffusion rankings highlights a lack of comparable usage data rather than a lack of activity. Without quantified metrics, their position in the regional landscape cannot be assessed in the same way as China, South Korea, or Singapore.

Looking to the Future

China is likely to continue strengthening its position in model development, given the narrowing performance gap and the scale of its domestic ecosystem.

South Korea’s trajectory suggests further gains where policy, language models, and consumer products continue to align.

Singapore’s early‑investment model gives it room to maintain high usage levels as tools mature.

India and Japan’s future visibility depends on the availability of consistent diffusion data.

Contrast with the United States and China

The United States advances through commercial scale and rapid enterprise adoption. China advances through coordinated capability building and sector‑led deployment. Much of Asia outside China follows neither path.

South Korea and Singapore show targeted national strategies that drive uptake through infrastructure, skills, and consumer‑level features rather than market competition or industrial planning.

Taken together, Asia presents a mixed picture: China as a rising technical competitor to the United States, South Korea and Singapore as fast‑moving national adopters, and other major economies with limited measurable diffusion.

This stands in contrast to the U.S. model of commercial scale and China’s model of coordinated deployment.

Australasia

Australia and New Zealand

The Australian Bureau of Statistics reports that 24 percent of Australian businesses used AI technologies in 2023 to 2024. For New Zealand, Digital Skills Aotearoa states that 19 percent of organisations were using AI tools in 2023.

Conclusions

Australia and New Zealand show a measured but steady pattern of enterprise‑level AI uptake. The figures point to two economies where adoption is present across a meaningful share of organisations, but not yet at the scale seen in the most rapidly diffusing countries. The pattern is one of gradual integration rather than rapid acceleration, shaped by existing digital capability and sector composition.

The evidence also suggests that both countries are moving from early experimentation into more routine operational use. The adoption levels recorded indicate that AI is no longer confined to isolated pilots but is beginning to appear in day‑to‑day business activity. What remains less clear is the depth of use within firms and the extent to which adoption is spreading beyond early movers.

Looking to the Future

The available data points to a likely continuation of this steady trajectory. Both economies have the digital foundations and organisational structures to support further uptake as tools mature and become easier to integrate. The current adoption levels suggest room for growth, particularly as more firms shift from exploration to implementation.

Future progress will depend on how quickly organisations can build skills, update processes, and adapt workflows to make effective use of AI. More consistent measurement would also help clarify how adoption evolves across sectors and firm sizes.

Overall, Australasia appears set for continued, incremental growth in AI use, driven by practical business needs and supported by existing digital capability.

Latin America

The OECD reports that around 20 percent of Mexican firms use at least one AI technology. Approximately 15 percent of Brazilian firms report the use of AI tools. In Chile, OECD statistics show that 12 percent of firms use AI technologies. Beyond these three countries, the Inter American Development Bank notes rising AI use across Latin America, especially in financial services and agriculture, but the IDB does not publish national percentages.

Conclusions

Latin America shows a pattern of steady but uneven enterprise‑level adoption. The available figures point to a region where AI use is present across major economies but varies widely in scale. Mexico, Brazil, and Chile each show meaningful uptake, yet none approach the levels seen in the fastest‑moving countries globally. The broader regional picture, drawn from IDB material, suggests that adoption is strongest in sectors with clear operational gains, notably financial services and agriculture. This indicates a practical, needs‑driven approach rather than a technology‑led surge.

The absence of consistent national metrics beyond the three reported countries highlights a measurement gap. It is difficult to assess the depth or spread of adoption across the region without comparable data, and the evidence that does exist points to early‑stage integration rather than widespread diffusion.

Looking to the Future

The current pattern suggests that Latin America is likely to continue along a sector‑led path, with adoption growing where AI delivers immediate operational value. Financial services and agriculture are well placed to deepen their use, given the early signs of traction. Broader uptake will depend on improvements in digital infrastructure, skills, and measurement, which remain uneven across the region.

More consistent reporting would help clarify how adoption evolves and where gaps remain. As tools become easier to deploy and integrate, there is room for growth across a wider range of sectors, but the pace will depend on the underlying capacity of firms and national digital systems.

Overall, the region shows early movement, concentrated in specific industries, with scope for further progress as capability and measurement improve.

Cross cutting themes

Infrastructure and skills as foundations

The AIEI diffusion study states that countries investing early in digital infrastructure, AI skilling, and government adoption now lead global usage rankings.

Uneven diffusion and a widening divide

The AIEI highlights a widening divide between the Global North and the Global South, with adoption in the Global North growing nearly twice as fast.

Commercial traction and enterprise demand

The State of AI Report 2025 and UNCTAD 2023 both point to strong commercial traction and rising enterprise demand.

Governance, safety, and regulation

The State of AI Report 2025 notes active regulatory developments and growing attention to risks associated with highly capable AI systems.

Conclusion

AI progress in 2024–2025 is accelerating, but unevenly. The UAE and Singapore show what coordinated national strategy and real‑world deployment can achieve, while the US, China and Europe continue to shape the frontier through research, investment and commercialisation.

The emerging divide is not East vs West, it is between nations operationalising AI at scale and those still discussing its potential.

Global Trends in AI
Global picture
North America
Europe, Middle East and Africa
Asia
Australasia
Latin America
- Conclusions
- Looking to the Future
Cross cutting themes
Conclusion
Related Work
Table of Contents
Further Reading

Global AI Trends 2024–2025

2026-05-04T00:00:00+00:00

Table of contents

Global Trends in AI

Artificial intelligence has entered a new phase. It is no longer a pilot or proof of concept. AI is core infrastructure; a technology that shapes how economies operate and how firms compete.

Global picture

Global adoption and diffusion

Commercial traction and investment

Conclusions

The global evidence points to three clear conclusions.

First, AI use is now widespread. McKinsey reports that 88 percent of firms use AI in at least one function, though most have yet to scale it across the enterprise.

Second, capability continues to rise. Stanford HAI shows sharp year‑on‑year improvements in benchmark performance and a steep fall in model‑usage costs.

Third, investment is concentrated. The United States leads private AI investment, with China closing the performance gap in model quality.

In the Future

The verified evidence suggests three grounded developments.

First, wider business uptake is likely. McKinsey finds most organisations are still in pilot mode, implying further diffusion as workflows are redesigned.

Third, investment patterns point to continued commercialisation. Stanford HAI records strong private investment in generative AI, with the United States far ahead of other economies.

These trends indicate a maturing technology, uneven readiness across regions, and a period where firms that can integrate AI into workflows will move faster than those still experimenting.

North America

United States

Canada and Mexico

Statistics Canada reports that 12.2 percent of Canadian firms used AI to produce goods or deliver services in 2025, with a further 14.5 percent planning to adopt AI within the following year.

This reflects a steady rise in enterprise use rather than a population level diffusion measure.

Mexico

Conclusions

The United States stands out for commercial uptake. In the U.S., public uptake is clearly more advanced, with clearer evidence of scale and investment.

Looking to the Future

Canada and Mexico

The United States

How U.S. businesses can build on their current position

Europe, Middle East and Africa

Europe

United Kingdom

The United Kingdom appears consistently in major global analyses as a leading centre for AI research, policy development, and commercial activity.

Middle East

Africa

Conclusions

Looking to the Future

Contrast with the United States and China

In Africa, adoption is limited by structural factors, not by market dynamics or state‑led programmes. Its direction is one of gradual capacity building rather than rapid scaling.

Asia

China

South Korea

The AIEI diffusion study highlights South Korea's rise from 25th to 18th place globally in 2025, driven by policy, improved Korean language model performance, and consumer facing features.

India and Japan

Singapore

Conclusions

Looking to the Future

China is likely to continue strengthening its position in model development, given the narrowing performance gap and the scale of its domestic ecosystem.

South Korea’s trajectory suggests further gains where policy, language models, and consumer products continue to align.

Singapore’s early‑investment model gives it room to maintain high usage levels as tools mature.

India and Japan’s future visibility depends on the availability of consistent diffusion data.

Contrast with the United States and China

South Korea and Singapore show targeted national strategies that drive uptake through infrastructure, skills, and consumer‑level features rather than market competition or industrial planning.

This stands in contrast to the U.S. model of commercial scale and China’s model of coordinated deployment.

Australasia

Australia and New Zealand

Conclusions

Looking to the Future

Overall, Australasia appears set for continued, incremental growth in AI use, driven by practical business needs and supported by existing digital capability.

Latin America

Conclusions

Looking to the Future

Overall, the region shows early movement, concentrated in specific industries, with scope for further progress as capability and measurement improve.

Cross cutting themes

Infrastructure and skills as foundations

The AIEI diffusion study states that countries investing early in digital infrastructure, AI skilling, and government adoption now lead global usage rankings.

Uneven diffusion and a widening divide

The AIEI highlights a widening divide between the Global North and the Global South, with adoption in the Global North growing nearly twice as fast.

Commercial traction and enterprise demand

The State of AI Report 2025 and UNCTAD 2023 both point to strong commercial traction and rising enterprise demand.

Governance, safety, and regulation

The State of AI Report 2025 notes active regulatory developments and growing attention to risks associated with highly capable AI systems.

Conclusion

The emerging divide is not East vs West, it is between nations operationalising AI at scale and those still discussing its potential.

Global Trends in AI
Global picture
North America
Europe, Middle East and Africa
Asia
Australasia
Latin America
- Conclusions
- Looking to the Future
Cross cutting themes
Conclusion
Related Work
Table of Contents
Further Reading

AI and Brands: A Practical Framework for Protecting and Strengthening Brand Equity

2026-04-28T00:00:00+00:00

Table of contents

AI and Brands: A Practical Framework for Protecting and Strengthening Brand Equity

Artificial intelligence is reshaping how organisations operate, communicate, and compete. For brand‑led companies, the central question is not whether to adopt AI, but how to do so without weakening the brand assets that drive long‑term equity. Evidence from early adopters across consumer goods, luxury, retail, financial services, and hospitality shows a consistent pattern: AI creates value when it strengthens precision, consistency, and operational control. It destroys value when it introduces noise, dilutes identity, or automates interactions that depend on human judgement.

This paper outlines a pragmatic framework for leaders who want to deploy AI responsibly. It focuses on brand integrity, operational discipline, and governance. The goal is to help organisations adopt AI in a way that protects their distinctiveness and enhances long‑term brand value.

1. Protect the Brand's Voice

Brand equity is built on consistent language, narrative structure, and creative identity. AI systems that generate content without guardrails often drift toward generic phrasing and inconsistent tone. This risk increases when organisations use public large language models trained on broad internet data.

Leaders should ensure that AI reinforces the brand's established voice rather than reinterpreting it. This requires controlled training data, clear tone guidelines, and human review for all customer‑facing outputs.

2. Prioritise Precision Over Scale

Many AI deployments focus on volume: more content, more interactions, more automation. Evidence from Harvard Business Review (2023) shows that this approach often reduces quality and erodes brand trust. High‑performing organisations use AI to improve accuracy, consistency, and operational foresight, not to increase output indiscriminately.

Precision‑oriented use cases include demand forecasting, inventory optimisation, quality control, and internal decision support.

3. Keep AI Invisible to the Customer

Customer experience research as reported in Journal of Service Research (2022) shows that trust, empathy, and discretion are strongest when interactions are human‑led. AI should support frontline teams with insight and preparation, not replace them. Automated customer communication often feels transactional and reduces perceived brand value.

AI is most effective when it enhances human performance without becoming visible to the customer.

4. Avoid Generic Models and Generic Content

Public models and automated content tools tend to produce language that is interchangeable across brands. This undermines differentiation and introduces tone drift. Organisations that rely on generic AI systems risk losing control of their narrative and weakening their competitive position.

Brand‑aligned AI requires private models, curated training data, and strict governance.

5. Pilot in Low‑Exposure Domains First

The most successful AI programmes begin with internal, low‑risk domains where accuracy and operational efficiency can be measured objectively. These include forecasting, supply chain optimisation, service diagnostics, and workflow scheduling.

Early pilots should focus on measurable improvements and operational fit before any customer‑facing deployment.

6. Build Private, Controlled Models

Brand language, archives, and internal knowledge are strategic assets. They should be treated as intellectual property and protected accordingly. Private models trained on controlled datasets reduce the risk of data leakage, tone drift, and unpredictable behaviour.

A smaller, well‑governed model is often more effective than a large, public one.

7. Maintain Human Authority

AI can analyse patterns and surface insights, but final decisions should remain human‑led. This is especially important in areas involving brand expression, creative direction, and customer relationships.

Human oversight ensures accountability, protects brand integrity, and prevents over‑automation.

8. Govern Early and Rigorously

Effective AI governance requires clear rules for data handling, model updates, access control, and auditability. Organisations that establish governance early experience fewer failures and lower reputational risk.

Governance should include tone standards, review processes, and regular evaluation of model behaviour.

9. Reject AI That Competes With Brand Craft

AI‑generated creative outputs, automated engagement systems, and public authentication tools for goods (such as Entrupy) often conflict with the brand's identity and expertise. These systems can erode trust, reduce perceived quality, and create a false sense of modernity.

AI should never replace the craft, judgement, or creative leadership that define the brand.

10. Use AI to Strengthen What Makes the Brand Distinctive

The purpose of AI is not to transform a brand into an "AI‑driven" organisation. The purpose is to deepen the qualities that already differentiate the brand: coherence, precision, reliability, and long‑term equity.

AI should act as a precision instrument that enhances operational discipline and brand consistency.

Conclusion

AI can strengthen a brand when deployed with discipline, clarity, and strong governance. It can weaken a brand when used without boundaries or when adopted for speed rather than strategic fit. Industry leaders who treat AI as a tool for precision, not automation, will protect their brand identity while gaining measurable operational advantage.

AI and Brands: A Practical Framework for Protecting and Strengthening Brand Equity
1. Protect the Brand's Voice
2. Prioritise Precision Over Scale
3. Keep AI Invisible to the Customer
4. Avoid Generic Models and Generic Content
5. Pilot in Low‑Exposure Domains First
6. Build Private, Controlled Models
7. Maintain Human Authority
8. Govern Early and Rigorously
9. Reject AI That Competes With Brand Craft
10. Use AI to Strengthen What Makes the Brand Distinctive
Conclusion
Related Work
Table of Contents
Further Reading

AI for Luxury Watchmaking: Discipline Over Display

2026-04-28T00:00:00+00:00

Table of contents

Luxury watchmaking faces pressure to adopt AI at the pace of mass‑market retail, yet most AI trends undermine the very qualities that define a maison: scarcity, discretion, and narrative integrity. This piece argues for a disciplined, tightly governed approach in which AI behaves like a precision instrument — strengthening forecasting, consistency, atelier operations, and clienteling — while avoiding automation that dilutes tone or erodes craft. The maisons that lead will be those that adopt AI with restraint, clarity, and long‑term intent, not speed.

AI for Luxury Watchmaking: Precision Over Hype

Luxury watchmaking has always balanced heritage and innovation. AI is now unavoidable, and many maisons feel pressure to adopt it quickly. This piece outlines where AI strengthens a watch manufacturer’s competitive position, and where it introduces unnecessary risk.

The Industry Tension: Innovation Without Dilution

Luxury watchmaking operates under a structural tension. A maison must preserve the integrity of its craft, its archives, and its creative identity, while the wider market moves at a pace set by digital platforms, globalised retail, and increasingly data‑driven competitors. The pressure to demonstrate technological progress is real, and the risk of adopting the wrong technology is equally real.

AI is often presented as a universal solution, although most proposals are designed for mass‑market retail and not for a sector that trades on scarcity, discretion, and long‑term brand equity. Many AI deployments introduce operational noise, dilute the maison’s voice, or create a level of automation that conflicts with the expectations of collectors and high‑net‑worth clients. The industry has seen a wave of generic chatbots, automated outreach tools, and broad language models that promise efficiency and deliver inconsistency.

The central question is not "Should we use AI" but "Where does AI reinforce what makes us rare". The answer lies in a disciplined approach that focuses on precision, control, and selective adoption. AI can support a maison when it strengthens the elements that define luxury watchmaking: exacting standards, consistent execution across global markets, and the ability to anticipate client needs without compromising the human relationship.

The tension is therefore not between tradition and technology. The tension is between technology that respects the craft and technology that erodes it. AI can help a maison operate with greater foresight, greater consistency, and greater control over its identity. AI can also undermine the maison if it is deployed without clear boundaries. The opportunity lies in identifying the narrow set of use cases where AI behaves like a precision instrument rather than a mass‑market automation tool.

A realistic approach recognises that AI is most valuable when it is invisible to the client, tightly governed, and aligned with the maison’s long‑term positioning. The maisons that succeed will be those that adopt AI with restraint, clarity, and a focus on reinforcing the qualities that already set them apart.

Where AI Strengthens a Watch Maison

Protecting Brand Voice and Heritage

AI can act as a controlled reference system for maison language. It can ensure that every market, boutique, and partner uses the same terms, descriptions, and narrative structure that the atelier would use. This reduces drift, removes local improvisation, and protects the tone that collectors recognise.

A fine‑tuned internal model can map archive material, historical catalogues, and technical glossaries into a consistent linguistic standard. This creates a single source of truth for product descriptions, press notes, and after‑sales communication.

Off‑the‑shelf chatbots introduce inconsistency and generic luxury phrasing. They also risk accidental disclosure of internal language patterns. A maison should avoid them entirely.

Precision Forecasting for Limited Editions

AI can analyse historical demand, collector behaviour, macroeconomic signals, and secondary‑market patterns to support decisions on production volumes. This reduces the risk of over‑allocation and under‑allocation, and it protects the reputation of the maison.

A transparent model can show which variables drive demand. This allows leadership to justify decisions with evidence rather than instinct alone. It also supports more disciplined release planning.

Opaque models that cannot explain their recommendations should be avoided. A maison needs clarity, not guesswork wrapped in mathematics.

Strengthening Clienteling Without Massification

AI can support client advisors with discreet and context‑aware insights. These insights can include purchase history, service intervals, collector preferences, and upcoming milestones. The aim is to help the advisor prepare, not to automate the interaction.

AI can also identify subtle behavioural patterns, such as a client who only responds to in‑person appointments or a collector who follows a specific complication family. This allows advisors to act with greater precision.

Automated outreach that feels transactional undermines the human relationship. A maison should avoid any system that sends messages without human review.

Atelier and After‑Sales Efficiency

AI can support predictive maintenance for complications and movements. It can identify early signs of wear from service records, images, and bench data. This allows the atelier to plan work more effectively.

AI can optimise scheduling for watchmakers by matching complexity, parts availability, and historical repair times. This reduces idle time and improves throughput without compromising craftsmanship.

AI‑assisted diagnostics can shorten the time between intake and assessment. The watchmaker still makes the final decision. Human judgement remains essential for quality control.

Provenance, Traceability, and Anti‑Counterfeit Measures

AI‑enhanced image recognition can authenticate watches from micro‑ details that are invisible to the naked eye. This strengthens provenance checks and reduces reliance on manual inspection alone.

Provenance systems can combine blockchain records and AI anomaly detection to flag suspicious transfers or listings. This protects both the maison and the collector.

Public‑facing "AI authentication apps" undermine exclusivity and create false confidence. A maison should avoid them. Authentication should remain controlled, discreet, and expert‑led.

What Luxury Watch Brands Should Ignore For Now

Luxury watchmaking gains nothing from technology that creates noise, dilutes identity, or introduces operational risk. Several AI trends are highly visible and highly unsuitable for a maison that trades on precision, scarcity, and long‑term equity.

One trend is the push toward generic generative‑AI content. This includes automated product descriptions, automated social posts, and automated campaign copy. These systems produce language that feels interchangeable across brands. They flatten tone, remove nuance, and replace the maison’s voice with a synthetic approximation. For a sector that relies on narrative integrity, this is a direct threat.

Or consider the rise of fully automated customer service. Many vendors promote AI as a replacement for human interaction. This may work in mass‑market retail, although it is unsuitable for luxury. Automated systems struggle with discretion, context, and emotional intelligence. They also create a visible gap between the client and the maison at the exact moment when trust matters most.

Lastly, the deployment of broad, ungoverned language models is proving more popular. These models are often trained on public data and they behave in ways t#hat are difficult to predict. They can leak internal phrasing, drift in tone, and generate outputs that conflict with brand standards. They also introduce data‑handling risks that are incompatible with the privacy expectations of high‑net‑worth clients.

A maison that values long‑term equity should treat these trends with caution. They offer speed, although they do not offer precision. They signal modernity, although they do not strengthen the qualities that make a luxury watchmaker distinctive. The disciplined path is to ignore these trends and focus on AI that enhances control, consistency, and craft.

Generic generative‑AI marketing content should be avoided. It produces language that feels interchangeable with mass‑market retail and it erodes the distinct tone that collectors expect. It also creates a false sense of digital progress without improving any core capability.

AI‑designed watches should be avoided. They conflict with the creative identity of the maison and they reduce design to pattern matching. A watch is an expression of craft, not an output of algorithmic experimentation.

Broad and ungoverned LLM deployments should be avoided. They risk data leakage, tone drift, and inconsistent behaviour across markets. They also create dependencies that are difficult to unwind.

A disciplined maison ignores these trends and focuses on AI that strengthens precision, consistency, and long‑term brand integrity.

A Practical, Low‑Risk AI Roadmap for a Watch Maison

Establish a Brand‑Aligned AI Charter

A maison needs a clear charter before it adopts any AI system. The charter defines what AI must never do, such as dilute tone, automate client relationships, or expose internal language patterns. It also defines what AI should do, such as improve forecasting, strengthen consistency, and support atelier operations. Every decision should be anchored in heritage, precision, and discretion. This prevents drift and keeps the programme focused on long‑term equity rather than short‑term experiments.

Build a Controlled and Private Model

A maison should build a controlled model that is trained on its own archives, glossaries, and tone guidelines. This creates a private linguistic and operational asset that reflects the identity of the brand. The model should remain behind the firewall and should be treated as intellectual property. A small and well‑governed model is easier to audit, easier to update, and less likely to behave unpredictably. This approach avoids the risks associated with broad public models.

Pilot in Non‑Customer‑Facing Domains

The safest starting point is to pilot AI in areas that do not touch the client. Forecasting, atelier scheduling, and after‑sales diagnostics are ideal candidates. These domains benefit from pattern recognition and data analysis, and they allow the maison to test accuracy, governance, and operational fit without reputational exposure. Early pilots should focus on measurable improvements, such as reduced turnaround time or more accurate allocation planning. This builds internal confidence before any client‑facing deployment.

Introduce AI to Clienteling as a Silent Partner

When the maison is ready to extend AI to the client experience, it should do so with restraint. AI should act as a silent partner that supports the advisor with insights, not scripts. It can highlight service intervals, collector preferences, and relevant milestones. It should never generate messages on its own. The advisor remains the author of every interaction. This preserves the human relationship and ensures that the maison’s tone remains intact.

Establish Governance Early

Governance is essential from the outset. Every client‑facing output should receive human review. Every model decision should have an audit trail. Tone and accuracy checks should be conducted regularly. The maison should also define clear rules for data handling, model updates, and access control. Strong governance prevents drift, protects client privacy, and ensures that AI remains aligned with the values of the brand.

A disciplined roadmap allows a maison to adopt AI without compromising craft, identity, or exclusivity. The goal is not to automate luxury. The goal is to use AI to strengthen the qualities that already make the maison distinctive.

The Competitive Advantage: AI as a Precision Instrument

The maisons that will lead are not the maisons that adopt AI at speed. They are the maisons that adopt AI with discipline, clear boundaries, and a focus on long‑term equity. Speed creates noise. Discipline creates advantage.

AI should behave like a fine tool on a watchmaker’s bench. It should be precise, reliable, and invisible to the client. The value comes from quiet improvements in forecasting, consistency, and operational control, not from visible automation or digital theatrics.

A disciplined maison uses AI to strengthen the elements that already define its position: exacting standards, coherent global execution, and a client experience built on trust. AI can support these strengths by reducing variance, improving anticipation, and protecting the maison’s voice across markets.

The goal is not to become an "AI‑driven brand". The goal is to use AI to deepen what already makes the maison exceptional. When AI is treated as a precision instrument, it enhances craft rather than competes with it.

Closing Thought

Luxury watchmaking has survived every major technological shift through careful selection and disciplined restraint. AI is no different. The value lies in choosing the narrow set of applications that strengthen craft, consistency, and control, and ignoring the noise that surrounds the wider market.

When applied with purpose and respect for the métier, AI becomes an instrument of precision. It sharpens forecasting, protects identity, and supports the atelier without altering the essence of the work. It remains silent, reliable, and firmly under human direction.

A maison that treats AI in this way preserves heritage while gaining a measurable operational advantage. The craft stays intact. The identity remains coherent. The technology serves the brand, not the other way round.

AI for Luxury Watchmaking: Precision Over Hype
The Industry Tension: Innovation Without Dilution
Where AI Strengthens a Watch Maison
What Luxury Watch Brands Should Ignore For Now
A Practical, Low‑Risk AI Roadmap for a Watch Maison
The Competitive Advantage: AI as a Precision Instrument
Closing Thought
Related Work
Table of Contents

10 Everyday AI Workflows That Save Hours

2026-04-26T00:00:00+00:00

Table of contents

Artificial intelligence is a practical tool that speeds up routine thinking tasks. These ten workflows show how everyone can use it to save minutes every day. Those minutes add up into hours each week. And practise will make you prompt perfect.

1. Turn messy notes into clean summaries

Example
You paste a rambling 500‑word meeting transcript. The system produces a clear summary with action points.

Example prompt
"Here are my messy meeting notes. Please summarise the key decisions and list the action items clearly."

2. Draft emails from bullet points

Example
You write a few rough points. The system turns them into a polished email.

Example prompt
"Turn these bullet points into a polite, professional email: apologise for delay and ask for feedback by this Friday."

3. Explain complex topics in plain English

Example
You paste a confusing medical letter. The system rewrites it in simple, accurate language.

Example prompt
"Rewrite this in plain English for a non‑expert reader. Keep it accurate but simple. Do not add anything to the content."

4. Create quick plans for travel, meals, or events

Example
You request a two‑day trip plan. The system provides a structured itinerary with alternatives.

Example prompt
"Plan a two‑day trip to Edinburgh with indoor options if it rains. Include timings."

5. Turn long articles into short takeaways

Example
You paste a long news article. The system produces a five‑point summary.

Example prompt
"Summarise this article into five key points and give me a one‑sentence takeaway."

6. Brainstorm ideas when you feel stuck

Example
You need a name for a community newsletter. The system generates several options.

Example prompt
"Give me ten name ideas for a friendly community newsletter about local events."

7. Rewrite text in different tones

Example
You paste a blunt message. The system rewrites it in a more diplomatic tone.

Example prompt
"Rewrite this message to be polite and constructive while keeping the meaning."

8. Extract key information from documents

Example
You upload a contract. The system identifies renewal dates, obligations, and risks.

Example prompt
"Extract the key dates, obligations, and cancellation terms from this contract. Do not invent anything. Only use the data I have provided to you."

9. Create checklists from goals

Example
You want to declutter your house. The system turns this into a room‑by‑room checklist.

Example prompt
"Turn this goal into a step‑by‑step checklist: declutter my entire house this month."

10. Turn data into quick insights

Example
You paste a small spreadsheet of expenses. The system highlights trends and suggests improvements.

Example prompt
"Here is my monthly spending data. Identify trends and suggest three ways to reduce costs. Use only the data I have provided to you."

Conclusion

Begin with one or two workflows and expand from there. Small time savings accumulate quickly, and these tools can help you stay organised, informed, and in control.

1. Turn messy notes into clean summaries
2. Draft emails from bullet points
3. Explain complex topics in plain English
4. Create quick plans for travel, meals, or events
5. Turn long articles into short takeaways
6. Brainstorm ideas when you feel stuck
7. Rewrite text in different tones
8. Extract key information from documents
9. Create checklists from goals
10. Turn data into quick insights
Conclusion
Related Work
Table of contents

Building Safe, Compliant and Sustainable LLM Systems

2026-04-26T00:00:00+00:00

Table of contents

Building Safe, Compliant, and Sustainable LLM Systems

Large language models have introduced a profound shift in how software systems are conceived, built, and governed.

LLMs behave differently from traditional software, they introduce new categories of operational and regulatory risk, and they demand a level of architectural discipline that many organisations have not yet developed. Senior engineering leaders must therefore approach LLM adoption not as a technical experiment, but as a strategic transformation that affects safety, compliance, cost control, and organisational design.

This article sets out the principles, mandates, measurements, processes, and governance structures required to build reliable, auditable, and economically sustainable LLM systems. It is written for leaders who must ensure that their organisations deploy these technologies with clarity, discipline, and long‑term resilience.

Why LLM Systems Behave Differently from Traditional Software

Traditional software is deterministic. Given the same inputs, it produces the same outputs. Its behaviour is governed by explicit logic, and its failure modes are generally predictable. LLM systems are different. They are probabilistic, context‑sensitive, and heavily influenced by the data and instructions that surround them. Their behaviour can drift over time as models are updated, retrieval indexes age, and prompts evolve.

This difference has significant implications. An LLM system is not a single component but a pipeline of retrieval, orchestration, context assembly, and model inference. Most of the risk lies not in the model itself, but in the machinery wrapped around it. The system behaves more like a distributed workflow, where each step introduces latency, ambiguity, and potential failure. This is why LLM systems require a different form of engineering discipline and a different form of leadership oversight.

What This Means for Safety, Compliance, and Cost

Because LLM systems are probabilistic and context‑dependent, they introduce safety risks that cannot be addressed by persuasion or by relying on the model to behave. Safety requires layered controls, deterministic boundaries, and independent checks. Compliance requires observability across the entire pipeline, not just the final output. Cost control requires architectural discipline, because most expenditure arises from retrieval hops, long prompts, and orchestration overhead rather than from the model itself.

The business consequences are clear. Without strong governance, an LLM system can drift into non‑compliant behaviour, generate outputs that cannot be audited, or accumulate cloud costs that grow faster than the user base. Leaders must therefore treat LLM systems as operational assets that require continuous monitoring, disciplined design, and explicit accountability.

What Leaders Must Mandate

Senior leaders must set the tone and direction. The following mandates are essential:

The organisation must treat LLM systems as engineered pipelines, not magical components.
Safety must be enforced through layered controls outside the model.
Retrieval must be disciplined, localised, and monitored for freshness.
Prompts must be treated as executable logic, not prose.
Observability must capture every transformation, including retrieval sets, template expansions, and decoding parameters.
Latency and cost must be managed through architectural simplification, not through attempts to accelerate the model.
Continuous evaluation must be mandatory, because behaviour drifts over time.

These mandates establish the foundation for predictable, compliant, and economically sustainable systems.

What Teams Must Measure

Measurement is essential for control. Teams must track:

Retrieval quality and freshness, because stale or irrelevant context is a major source of error.
Latency across the entire pipeline, not just the model call.
Prompt length and token usage, because long prompts silently inflate cost and delay.
Orchestration overhead, including serial tool calls and unnecessary network hops.
Behavioural drift, measured through continuous evaluation against real traffic.
Safety violations caught by guardrails, and those that slipped through.
Cloud expenditure broken down by retrieval, orchestration, and inference.

These measurements allow leaders to understand where risk accumulates and where costs originate.

What Processes Must Change

LLM systems require new processes that reflect their probabilistic nature and their architectural complexity. Traditional software processes are insufficient. Organisations must introduce:

Continuous evaluation pipelines that run against real user traffic patterns.
Retrieval monitoring processes that detect index drift and data staleness.
Prompt review processes that treat prompts as code and enforce structure.
Safety review processes that test layered guardrails under varied phrasing.
Cost review processes that examine token usage, retrieval hops, and orchestration patterns.
Incident response processes that include retrieval logs, template expansions, and decoding parameters.

These processes ensure that the system remains stable, compliant, and economically viable over time.

What Architectural Principles Must Be Enforced

Architectural discipline is the strongest determinant of safety, reliability, and cost. Leaders must enforce the following principles:

Latency is architectural. Most delay comes from retrieval hops, network boundaries, and orchestration overhead.
Retrieval must be minimal, local, and purposeful. Excessive retrieval behaves like an over‑eager microservice mesh.
Prompts must be short, structured, and treated as logic.
Context windows are scratchpads, not memory. Only relevant information should enter them.
Safety must be enforced through deterministic layers, not through persuasive instructions.
Pipelines must avoid serial tool chains that behave like queues.
Orchestration must be simplified wherever possible, because overhead accumulates across every request.

These principles reduce risk, improve predictability, and control cost.

What Governance Structures Must Be Introduced

Governance is essential for organisations that wish to deploy LLM systems at scale. Leaders must introduce:

A cross‑functional LLM governance board that oversees safety, compliance, and cost.
A prompt governance process that ensures consistency, clarity, and auditability.
A retrieval governance process that monitors data freshness, index quality, and access control.
A safety governance framework that defines layered guardrails and tests them regularly.
A cost governance framework that tracks expenditure and enforces architectural discipline.
A model update governance process that evaluates behavioural drift before deployment.

These structures ensure that the organisation maintains control over systems that are inherently probabilistic and prone to drift.

Conclusion

LLM systems offer extraordinary potential, but they demand a level of discipline, governance, and architectural clarity that many organisations have not yet developed. They behave differently from traditional software, and they introduce new categories of risk that cannot be managed through persuasion or intuition. Senior leaders must therefore mandate strong architectural principles, enforce rigorous measurement, introduce new processes, and build governance structures that ensure safety, compliance, and cost control.

The organisations that succeed will be those that treat LLM systems as engineered pipelines, that design for predictability and auditability, and that recognise that the true challenges lie not in the model, but in the machinery that surrounds it.

Building Safe, Compliant, and Sustainable LLM Systems
Related Work
Table of Contents

Evaluating AI Systems: Metrics that Matter

2026-04-26T00:00:00+00:00

Table of contents

This article presents metrics that matter to help you evaluate an LLM for programmatic use.

Metrics to Evaluate AI Systems

1. Evaluation as an Engineering Discipline

Evaluating an AI system differs from evaluating deterministic software. LLMs generate tokens based on probability, so behaviour varies across runs and model updates. Effective evaluation focuses on observable behaviour, failure modes, and interface stability. The aim is to measure real system behaviour, not synthetic benchmarks.

2. The Evaluation Surface Area An AI system exposes a wide surface area.

Some parts are controlled by the model, such as token prediction, internal weights, and sampling. Other parts are controlled by you, including prompt structure, constraints, retrieval inputs, output formats, and integration. Good evaluation measures the combined behaviour of both sides.

3. Core Metrics for Programmatic Use

Systems that call an LLM as a component must measure schema reliability, instruction adherence, deterministic stability, and latency. Schema reliability covers valid JSON, field completeness, and type correctness. Instruction adherence measures how well the model follows constraints. Deterministic stability checks variance under fixed sampling. Latency covers time to first token, total response time, and variability.

4. Metrics for RAG Systems

RAG adds new evaluation needs. Grounding fidelity measures alignment between claims and retrieved documents. Fidelity is about how faithfully the model sticks to the source material. Citation accuracy checks that references are correct and not invented. Retrieval quality evaluates recall, precision, and chunking impact. These metrics show whether the system uses retrieval effectively.

5. Metrics for Public‑Facing Systems

Public‑facing systems require safety and behavioural stability. Safety metrics measure disallowed or high‑risk content and consistency across paraphrased prompts. Behavioural stability measures tone consistency, avoidance of persona drift, and predictability across varied inputs.

6. Metrics for Reasoning Systems

Reasoning systems must evaluate logical consistency, task breakdown, and error sensitivity. Logical consistency checks for contradictions. Task breakdown measures whether sub‑tasks are identified and ordered correctly. Error sensitivity evaluates behaviour under incomplete or conflicting information.

7. Failure Mode Analysis

Evaluation must include attempts to trigger failure modes. Boundary tests check for fabricated tools or capabilities. Hallucination tests examine behaviour under missing, conflicting, or overloaded context. Prompt dilution tests measure behaviour when constraints overlap or when the system prompt becomes long.

8. Longitudinal Metrics

AI systems change over time, so evaluation must track drift. Model update drift measures behavioural changes after updates and detects regressions. Prompt stability metrics measure sensitivity to small edits or ordering changes. Longitudinal evaluation ensures stability as the model evolves.

9. Practical Evaluation Framework

A practical framework includes unit tests for prompt layers, integration tests for retrieval, and end‑to‑end tests for workflows. Golden sets provide curated inputs with expected outputs for regression detection. Failure logging categorises schema errors, grounding failures, reasoning failures, and safety violations.

10. Evaluation as Ongoing Engineering Work

Evaluation is continuous. AI systems require ongoing measurement because their behaviour is probabilistic and subject to change. Metrics must reflect real failure modes and integration points.

A structured evaluation framework produces systems that behave predictably, integrate cleanly, and remain stable over time.

Conclusion

Evaluating AI systems is not a narrow task.

It spans deterministic correctness, probabilistic behaviour, grounding, safety, reasoning, retrieval, latency, and long‑term drift.

The surface area is far larger than that of conventional software components, because an AI system is not only the model but also the constraints, prompts, retrieval pipeline, and integration code wrapped around it.

A structured evaluation framework is therefore essential.

Programmatic use requires metrics for schema reliability, instruction adherence, deterministic stability, and latency.

RAG systems add grounding fidelity, citation accuracy, and retrieval quality.

Public‑facing systems require safety and behavioural stability.

Reasoning systems require checks for logical consistency, task decomposition, and error sensitivity.

Failure mode analysis must deliberately probe boundary violations, hallucination conditions, and prompt dilution.

Longitudinal metrics must track drift across model updates and prompt changes.

A practical framework must combine unit tests for prompt layers, integration tests for retrieval, end‑to‑end workflow tests, golden sets, and structured failure logging.

The conclusion is unavoidable: this is not work that can be handled as a side‑task by feature developers. The evaluation load is continuous, specialised, and multi‑disciplinary. It requires expertise in retrieval, safety, reasoning, software correctness, and long‑term system behaviour. It requires adversarial testing, regression detection, and maintenance of a living evaluation suite. The cost of inadequate evaluation is high: schema failures, grounding errors, safety issues, reasoning faults, and silent regressions, any one of which may lead to a lack of compliance and statutory exposure.

AI evaluation is its own engineering discipline. It requires a dedicated team with clear ownership, specialised tooling, and ongoing responsibility for ensuring that AI systems behave predictably, integrate cleanly, and remain stable over time.

Metrics to Evaluate AI Systems
Conclusion
Related Work
Table of Contents

How to Evaluate the Output of an AI Chat Session

2026-04-26T00:00:00+00:00

Table of contents

How to Evaluate the Output of an AI Chat Session

Introduction

Many people now use chat systems powered by artificial intelligence for writing, research, planning, or quick explanations. These systems can be helpful, but their output varies in quality. Some responses are clear and accurate, while others may be incomplete, misleading, or overly confident. Understanding how to evaluate what you receive makes the experience more efficient and safer.

A simple example shows why this matters. Someone might ask a chat system for a summary of a historical event and receive a clear explanation. The same person might then ask for a legal interpretation and receive an answer that sounds confident but is not reliable. The difference is not always obvious from the tone of the response.

Start With the Purpose of the Conversation

It helps to keep in mind what you are trying to achieve. A chat system can produce ideas, drafts, explanations, or examples very quickly. It is less reliable when the task requires specialist judgement, up‑to‑date facts, or precise interpretation.

For instance, asking for help brainstorming a travel itinerary is usually safe. Asking for a diagnosis based on symptoms is not. The system may sound equally confident in both cases, so the purpose of the conversation matters.

Check Whether the Output Matches the Question

Sometimes a chat system answers a slightly different question from the one you asked. This can happen when the prompt is broad or when the system tries to guess your intent.

A simple way to check is to read the answer and ask whether it addresses the specific point you raised. If you ask for "three reasons why a bridge design failed" and receive a general explanation of bridge engineering, the output is not wrong, but it is not what you asked for.

Look for Verifiable Details

Useful responses often contain information that can be checked. This might be a definition, a date, a description of a process, or a reference to a known concept. When a response includes details that can be confirmed, it becomes easier to judge its reliability.

For example, if you ask about how a particular sensor works, a good answer might describe the physical principle behind it. If the answer instead gives vague phrases such as "advanced technology" or "cutting edge performance", it may not be providing real information.

Notice When the System Sounds Certain

Chat systems often express ideas in a confident tone, even when the underlying information is uncertain. This is a normal behaviour of the technology, but it means that confidence should not be taken as a sign of accuracy.

A relatable example is when someone asks for the opening hours of a local shop. The system may provide a clear answer, but unless it has access to current information, the hours may be outdated or incorrect. The tone does not reflect the reliability.

Compare the Output With What You Already Know

If the response touches on a topic you understand, a quick comparison can reveal whether the system is on the right track. If something feels inconsistent with your knowledge, it may be worth checking further.

For instance, if you ask about a programming concept you use regularly and the answer describes it in an unfamiliar way, that is a signal to verify the information.

Ask for Clarification or a Different Angle

If a response seems incomplete or unclear, asking the system to explain the idea in a different way can help. Many people find that asking for an example, a step‑by‑step explanation, or a simpler description reveals whether the system actually captured the idea.

A practical example is when someone asks for an explanation of a financial term. If the first answer feels abstract, asking for "a simple example using everyday numbers" often makes the concept clearer.

Be Cautious With Sensitive or High‑Impact Topics

Some areas require extra care. These include medical advice, legal interpretation, financial decisions, and safety‑critical information. Chat systems can generate plausible text in these areas, but plausibility is not the same as accuracy.

A symptom checker example illustrates this. A system may describe a condition in a way that sounds precise, but it cannot assess real‑world risk or context. In such cases, the output should be treated as general information, not as a basis for action.

Look for Signs of Fabrication

Chat systems sometimes produce details that sound real but are not. These may include invented citations, incorrect statistics, or descriptions of events that never occurred. This behaviour is not intentional, but it can mislead readers who assume the information is factual.

A common example is when someone asks for a reference to a scientific paper and receives a title and author that look plausible but do not exist. Checking the reference quickly reveals the issue.

Use the System as a Tool, Not an Authority

A chat system can be a helpful assistant for drafting, exploring ideas, or learning about a topic. It is less suited to acting as a final source of truth. Treating it as a tool rather than an authority helps keep expectations realistic and reduces the risk of relying on incorrect information.

Conclusion

Evaluating the output of an AI chat session is a practical skill. Paying attention to the purpose of the conversation, the clarity of the answer, the presence of verifiable details, and the sensitivity of the topic can make the experience more effective and safer. With a few simple habits, it becomes easier to recognise when the system is providing useful insight and when additional checking is needed.

How to Evaluate the Output of an AI Chat Session
Related Work
Table of Contents

How to Use AI Safely and Effectively

2026-04-26T00:00:00+00:00

Table of contents

Recent headlines have shown the same unsettling pattern.

An AI system confidently generated legal cases that never existed, as reported when UK courts received filings built on fictitious case law (The Guardian, Scottish Legal News).

Health researchers have warned that AI can give medical guidance that is not just inaccurate but dangerously misleading. A British Medical Journal article as reported in the Independent stated that 20% of AI medical answers were "highly problematic".

And tech reporters have documented AI‑generated news summaries that included entirely fabricated headlines and events (Sky News).

In every case, the system generated output that communicated total confidence. In every case, the AI was wrong. Fluency is not understanding. Appearing proficient is not accuracy. This confusion is exactly where the real risk lies.

Give Clear Instructions

AI works best when you tell it exactly what you want. It does not infer your intentions or read between the lines. The output you see is a statistical software prediction based on patterns in the training data of the AI. The clearer your request, the better the output.

Start by stating your goal. Instead of asking, "Tell me about climate change," try: "Give me a 150‑word summary of the main causes of climate change for a general audience." A specific target gives the system's statistical pattern-matching something concrete to aim at.

Set the format you want. Simple instructions like "Give me three options," "Write this as a short email," or "List the steps in order" immediately improve the result. Format acts as a constraint, and constraints make the output sharper.

Define the audience. AI changes tone and detail depending on who you say it is for: beginners, executives, customers, or the general public. A single line about the audience can transform the clarity of the answer.

If accuracy matters, add constraints such as "Use widely accepted information," "If you’re unsure, say so," or "Do not invent details." These reduce the risk of confident mistakes.

Clear instructions make the output better and safer, but they do not eliminate the risk of mistakes. Even with perfect prompts, a system can still deliver something that sounds certain but is completely wrong.

The AI is not weighing evidence or checking facts. AI is programmed to produce an answer that appears most likely based on patterns in its training data. When those patterns point in the wrong direction, the result is a confident mistake. Your prompt has to help the AI navigate any bias or missing data in its training data. Think of your prompt as you nudging the AI in the direction you want to go.

When your task is large, break it into smaller steps. Ask for an outline first, then expand each section. AI performs far better when guided step‑by‑step.

Clear instructions don’t just improve the output, they keep you in control of the process.

Provide Enough Context

AI performs noticeably better when it has the background information it needs, such as who the audience is, what the situation involves, or what constraints apply.

When context is missing, the system often fills in the gaps with incorrect predictions that will look like guesses, and recent reporting shows how easily this can go wrong. The Guardian found that Google AI Overviews gave misleading health advice because the AI responded without understanding the medical circumstances involved, including a case where it advised pancreatic cancer patients to avoid high fat foods, which experts described as really dangerous. This is dangeous advice as some who suffer from pancreatic cancer are malnourished and consuming fat can be a nutritionally efficient way to ingest energy.

Check the Output Carefully

AI is not a source of truth, it is a generator of plausible answers, so treat every response as a draft, not a verdict.

Read the answer to then ask basic questions: Does this match what you already know, does it contradict trusted sources, does anything feel too neat or too extreme?

For factual topics, spot check key claims against reputable outlets or official documentation, especially numbers, names, dates, web links, and legal or medical details.

For writing tasks, look for invented quotes, fake references, or details that are oddly specific without any support.

If something important hinges on the answer, ask the system to show its reasoning, to list uncertainties, or to offer alternative possibilities.

The core habit is simple: never confuse a confident tone with a reliable answer. Once you see the answer you can ask the AI more questions to check the reliability of that answer. This is especially important if you are going to do something that relies on that answer.

Use AI for the Right Tasks

AI is most effective when the work involves drafting, summarising, organising ideas, exploring options, or speeding up early stage thinking.

AI can turn rough notes into a clean paragraph, reshape a long document into a shorter one, or generate several ways to frame a problem so you can choose the best one.

AI is also useful for outlining reports, comparing approaches, rewriting for different audiences, or helping you see alternatives you might not have considered. These are tasks where speed and structure matter more than perfect accuracy. You can make text accurate later.

AI is far less reliable when the task requires expert judgment, real world verification, or precise factual detail, so keep it focused on the parts of the job where it can genuinely help rather than the parts where it can get you into trouble.

Keep in mind that AI is not thinking. AI does not check for truth. It generates plausible text based on its training data.

Avoid Using AI for Judgement or Decisions

AI cannot weigh values, consequences, or ethics, and it cannot understand the human context that sits behind real decisions.

AI can offer options, outline trade offs, or summarise information, but it cannot decide what matters most, what is acceptable, or what is fair. Those choices rely on experience, responsibility, and an understanding of people, none of which an AI possesses.

Use AI to support your thinking, not to replace it. Human judgement must stay in charge, especially when the outcome affects safety, wellbeing, trust, or the outcome has long term consequences.

Be Cautious with Personal or Sensitive Information

Treat AI tools the same way you would treat an online form or an email to someone you do not know.

Do not share details that could identify you, expose someone else, or create problems if they were ever seen by the wrong person. This includes financial information, medical records, passwords, private conversations, or anything that involves children, colleagues, or business clients.

Keep the boundary simple. If you would hesitate before typing it into a website, keep it out of an AI prompt. The safest approach is to describe the situation in general terms and remove anything that is not essential to the task. This protects your privacy and prevents sensitive information from being handled in ways you cannot control.

Compare Answers with Reliable Sources

Treat AI output as a starting point, not a final answer, and cross check anything that matters with sources you trust.

This is especially important for facts that are time sensitive, technical, or likely to change. A quick comparison with reputable news outlets, official guidance, or well established reference material can reveal errors that are easy to miss when the writing sounds polished.

This habit is not about distrusting the tool, it is about protecting yourself from mistakes that come from outdated information, missing context, or confident AI guesses. When accuracy matters, a second source is not optional, it is part of the process.

Keep an Eye Out for Gaps or Oddities

A useful habit when reading AI generated answers is to notice when something feels slightly off. This might be an explanation that is too vague, a claim that is oddly specific without support, or a confident statement that does not match what you know.

When you see these signs, pause and ask a follow up question or check the detail elsewhere.

Recent reporting shows how easily small oddities can signal a deeper problem. The Guardian described how a senior European journalist was suspended after using AI tools to summarise material and then publishing quotes that the people involved had never said. The investigation found dozens of invented statements that looked polished and authoritative but were entirely false, and the journalist admitted he had fallen into the trap of trusting text that only sounded right.

Examples like this show why readers should stay alert to gaps, inconsistencies, or moments when an answer feels too neat. These are cues to check the AI's output.

Stay Aware of the Limits of AI

AI does not understand meaning, it has no lived experience, and it cannot draw on intuition or common sense.

AI works by recognising patterns in data and producing text that fits those patterns, not by grasping the reality behind the words. This means it can miss context, overlook nuance, or present something that sounds authoritative without any understanding.

AI cannot feel uncertainty, it cannot judge what is important, and it cannot tell when it has made a mistake. Keeping these limits in mind helps you use the tool for what it is good at and avoid expecting it to behave like a person.

Give Clear Instructions
Provide Enough Context
Check the Output Carefully
Use AI for the Right Tasks
Avoid Using AI for Judgement or Decisions
Be Cautious with Personal or Sensitive Information
Compare Answers with Reliable Sources
Keep an Eye Out for Gaps or Oddities
Stay Aware of the Limits of AI
Related Work
Table of Contents
Further Reading

Latency is architecural

2026-04-26T00:00:00+00:00

Table of contents

Latency is architectural

Most latency comes from retrieval hops, long prompts, and serial tool calls. The model call is rarely the slow part. The pipeline is the bottleneck. Optimise orchestration, not just the model.

Engineers often assume the model is the slow part. It usually is not. The real drag comes from the machinery wrapped around it.

Retrieval hops cost more than you expect

Every vector search, metadata filter, re‑rank, and chunk stitch is another network hop. Do that a few times and half your latency budget has vanished before the model has even seen a token. It is the old "too many microservices" problem wearing a new badge.

Too Many microservices

A system begins tidy, then grows arms and legs. Someone adds a retriever. Someone adds a re‑ranker. Someone adds a metadata filter. Someone adds a chunk stitcher. Each piece looks harmless. Each piece solves a problem. But once they are strung together, the whole thing slows to a crawl.

RAG pipelines follow the same pattern. Instead of ten microservices, you now have ten retrieval hops. Instead of service chatter, you have index chatter. Instead of JSON bouncing around a cluster, you have embeddings and chunks being passed across the network. The labels have changed, but the behaviour has not.

In a microservice stack, services talk to each other all day long. They pass JSON around, wait for replies, retry on failure, and generally keep the network busy. That is service chatter.

In a RAG stack, the same noise comes from your retrieval layer. The actors are different, but the behaviour is the same. Your vector index, keyword index, metadata store, and re‑ranker all talk to each other. They pass embeddings, scores, filters, and chunks back and forth. Each hop is another round trip. Each hop adds delay. Each hop adds another place for things to wobble.

It is chatter because none of it is real work from the user’s point of view. The user wants an answer. The system spends most of its time gossiping between indexes about which chunk might be relevant. It is busy, but not productive.

The point is simple. You have replaced one kind of internal noise with another. The labels have changed, but the cost has not. If you let the retrieval layer grow without discipline, it will behave exactly like an over‑eager microservice mesh. It will talk too much, wait too long, and slow everything down.

Every hop adds latency. Every hop adds a failure mode. Every hop adds mental overhead. Hop latency accumulates in the end-to-end-pipelines. The job becomes debugging the plumbing rather than improving the product. The system becomes sluggish, brittle, and full of odd surprises.

The lesson is the same as it was during the microservice boom. Keep the number of moving parts low. Keep the boundaries clear. Keep the data local whenever you can. If you do not, the pipeline will drag, no matter how fast the model is.

Leaving the process costs you

Vector search is typical for RAG, but it is not the only culprit. Any retrieval layer that reaches across the network will cost you time. It does not matter whether you use a vector index, a keyword index, a hybrid index, or a bespoke store. If you have to leave the process, hit a service, wait for it to return, and then stitch the results back together, you will pay for it in latency.

Long prompts are silent killers

Sending 200,000 tokens into a model is not free. As of April 2026, GPT-5.5 is USD 5.00 per 1 million tokens, so USD 1 for 200k tokens. This might not sound much but if your whole AI system that is made up from multiple pipelines calls OpenAI a thousand times in an eight-hour period, that is one call every 86 seconds, costing USD 1,000 per day. As you introduce features that rely on AI, this cost can balloon.

You pay for tokenisation, network transfer, and ingestion. It is the equivalent of posting a novel every time you want a paragraph back. Shorter prompts are not only cheaper, they are faster and far easier to reason about.

Cloud costs balloon because the pricing model rewards scale until it punishes you. Everything looks cheap at the start. A few API calls here, a small vector index there, a modest GPU for a prototype. Then the system goes live, traffic rises, and the bill climbs faster than the usage graph.

The pattern is predictable. You pay for every hop, every lookup, every token, every gigabyte, and every idle minute. The cloud does not care whether the work was useful. It charges for activity, not value.

RAG pipelines are especially prone to this. Retrieval is chatty. Each query touches several indexes. Each index has its own storage, compute, and network fees. The model call is only one line on the invoice. The real cost comes from the scaffolding wrapped around it.

Costs balloon because the architecture balloons. More hops. More services. More indexes. More caching layers. More background jobs. More monitoring. More logs. Every piece adds a little cost. Together they add a lot.

The cloud makes it easy to scale up, but it does not make it easy to scale down. Once the system is busy, you pay for the peaks, not the averages. You pay for the buffers, the replicas, and the safety margins. You pay for the comfort of not waking up at three in the morning.

The cloud invoice is driven by the highest sustained load, not the gentle baseline you see on a dashboard.

Cloud platforms charge for capacity, not comfort. When traffic spikes, the system scales out. Extra replicas spin up. Buffers grow. Queues stretch. More storage is touched. More network is consumed. The platform does not scale back the instant the spike ends. It holds the extra capacity for safety, stability, and headroom. You pay for that headroom.

The average load might look modest, but the cloud does not bill you on the average. It bills you on the resources that were provisioned to survive the worst ten minutes of the day. If your peak is ten times your baseline, your bill will reflect the peak, not the baseline.

The only defence is discipline. Keep the design lean. Keep the hops few. Keep the data local. Keep the retrieval tight. Keep the prompts short. Keep the pipeline simple. If you do not, the cloud bill will grow faster than the user base, and it will not stop until you force it to.

Serial tool calls turn your pipeline into treacle

If your workflow is LLM → tool → LLM → tool → LLM, you have built a queue, not a pipeline. Everything waits for everything else. It is the same anti‑pattern that made synchronous RPC chains painful in the early microservice era.

A queue and a pipeline look similar on a whiteboard, but they behave very differently once traffic hits them. The distinction matters, because one keeps work moving and the other forces everything to wait its turn.

A queue is a stop‑start system. Each step blocks until the previous step has finished. Nothing can overtake anything else. If one stage slows down, the entire flow backs up behind it. This is what happens when you chain LLM calls and tools in a strict sequence. The second LLM call cannot begin until the tool has replied. The tool cannot run until the first LLM call has finished. The whole thing becomes a single‑file line.

A pipeline is a flow system. Work moves through independent stages that can run at the same time. Stage one can process ithe next item while stage two handles item one. Throughput rises because the stages overlap. The system does not wait for each piece to finish before starting the next. This is how high‑volume systems stay fast even when individual steps are slow.

A queue waits for the whole journey. A pipeline hands work off and moves on.

The handoff is the key. Once a stage can pass work downstream and start the next item without waiting, you have built a pipeline, not a queue.

The problem with LLM → tool → LLM → tool → LLM is that it behaves like a queue. Every step waits for the previous one. There is no overlap, no parallelism, and no slack. One slow tool call stalls the entire chain. It is the same pattern that made synchronous RPC chains painful in early microservice designs. The system is busy, but nothing is flowing.

The lesson is simple. If you want speed, build a pipeline. If you build a queue, do not be surprised when everything crawls.

4. Orchestration overhead accumulates

Glue code, JSON wrangling, retries, fallbacks, schema checks, and all the other dull bits. Each one is tiny. Each one feels harmless. Together they slow the system more than any single model call ever will.

The overhead hides in plain sight. A few milliseconds to validate a schema. A few more to serialise a payload. A few more to deserialise it. A few more to retry a flaky call. A few more to merge two partial results. None of these steps look expensive on their own. They are not. The cost comes from the fact that you do them on every request, across every stage, under load.

This is why orchestration overhead is so deceptive. It does not arrive as one big hit. It arrives as a hundred small ones. It is death by a thousand cuts. The pipeline spends more time preparing to do work than doing the work.

The worst part is that this overhead grows with complexity. Add one more tool call, and you add one more round of serialisation. Add one more fallback, and you add one more branch to evaluate. Add one more schema, and you add one more validation pass. The system becomes a tangle of tiny chores.

This is usually where the real time goes. Not in the model. Not in the vector search. Not in the database. In the glue. In the stitching. In the invisible admin that surrounds every step. The only fix is discipline: fewer hops, fewer formats, fewer retries, fewer moving parts. The less you orchestrate, the faster everything becomes.

The model is rarely the bottleneck

Modern inference is GPU‑accelerated and heavily optimised. Your RAG stack is a distributed system full of I/O, hops, and blocking calls. Optimising the model while ignoring the pipeline is like tuning the engine while the tyres are flat. The power is there, but the car still drags.

Modern LLM inference is brutally efficient. The kernels are fused. The memory access patterns are tuned. The batching is tight. The GPUs run flat out. The model is rarely the slow part. It is the most optimised component in the entire stack, because it has to be. Vendors pour millions into shaving microseconds from calculation paths.

Your RAG pipeline is the opposite. It is a distributed system stitched together from storage calls, network hops, serialisation steps, retries, and blocking operations. Every part of it waits for something else. Every hop crosses a boundary. Every boundary adds latency. The model is a rocket engine bolted to a shopping trolley.

This is why polishing the model is the wrong instinct. You can shave 10 percent off inference time and never notice it, because the pipeline is burning that time several times over in glue code and I/O. The GPU is idle while your retriever fetches chunks. The retriever is idle while your re‑ranker waits for a schema check. The re‑ranker is idle while your orchestrator serialises JSON. The whole system is dominated by the slowest, least optimised parts.

The handbrake is the pipeline. The bonnet is the model. Shining the bonnet does not make the car move. Releasing the handbrake does. If you want real speed, you fix the hops, the queues, the blocking calls, the retries, the formats, and the orchestration. That is where the time goes. That is where the wins are.

Throughput beats single‑query latency

In a real system, throughput matters more than shaving a few milliseconds off a single request.
Throughput keeps queues short, users calm, and servers steady.
A system that flows well will always outperform a system that only looks fast in isolation.

A design that includes:

parallel retrieval
batched vector queries
cached embeddings
pre‑computed context
non‑blocking tool calls

will outrun a "fast" single‑query setup every day of the week.

Think like a backend engineer, not a demo builder.
Design for flow, not fireworks.

Evaluation must be continuous

LLM behaviour drifts. Model updates shift outputs. Data changes. Prompt templates evolve. Retrieval indexes age. Static tests decay. Continuous evaluation with real traffic patterns is the only stable approach.

LLMs are not fixed points. They are moving systems. Vendors update weights. Safety layers change. Tokenisers shift. Even subtle adjustments can alter how a model interprets a prompt or ranks retrieved context. A test that passed last month can fail today without any change in your code.

Your data is not fixed either. Documents are added, removed, rewritten, or re‑indexed. Embeddings drift as models change. Metadata grows stale. A retrieval query that once surfaced the right chunk may surface something weaker six weeks later. The index ages, and the quality of the answer ages with it.

An embedding will turn a sentence into a list of numbers where similar items end up close together.

Prompt templates evolve as well. You tweak wording. You add guardrails. You change formatting. You introduce new variables. Each change shifts behaviour in ways that are hard to predict. A small edit can ripple through the entire pipeline.

Static tests cannot keep up with this movement. They freeze expectations in time. They assume the system is stable. It is not. The tests decay because the system they measure is drifting underneath them. A green test suite can give a false sense of confidence while the live system quietly degrades.

The only reliable approach is continuous evaluation with real traffic patterns. You must measure quality under the same conditions the system actually faces: real prompts, real retrieval noise, real user phrasing, real edge cases, real load. Automated reality is required. This is the only way to detect drift early and correct it before it becomes visible to users.

The system is alive. The evaluation must be alive with it.

Guardrails must be layered

No single guardrail is enough. Combine input checks, retrieval filters, prompt constraints, output checks, and post‑processing. Each layer catches different failures. One layer alone invites outages.

Guardrails fail for different reasons. Input checks catch malformed or hostile queries, but they cannot see what retrieval will surface. Retrieval filters remove unsafe or irrelevant chunks, but they cannot stop a prompt template from mis‑framing the task. Prompt constraints shape model behaviour, but they cannot guarantee the model will obey them under stress. Output checks catch violations after the fact, but they cannot prevent the model from producing them in the first place. Post‑processing can clean up structure, but it cannot repair a fundamentally wrong answer.

Each layer has blind spots. Each layer has failure modes. Each layer protects a different part of the system. When you stack them, the gaps do not align. When you rely on one, the gaps are exposed.

This is why single‑layer safety is fragile. A lone input filter cannot stop a retrieval glitch. A lone output checker cannot stop a prompt injection. A lone prompt template cannot stop a malformed chunk. A lone retrieval filter cannot stop a model hallucination. Outages happen when one layer is asked to do the job of five.

A robust system uses layered defence:

input validation to reject malformed or hostile queries
retrieval filtering to control what context enters the model
prompt constraints to shape behaviour and reduce ambiguity
output checks to enforce structure and detect violations
post‑processing to normalise, redact, or correct

None of these layers is perfect. Together they are resilient. That is the point. Modern LLM systems fail in many small ways, not one big way. The only stable approach is to catch small failures early, often, and repeatedly across the pipeline.

The future is orchestration

The next wave is not bigger models. It is coordination across many specialised models. It is managing context across workflows. It is building predictable tool‑calling chains. LLMs are components now. The engineers who master orchestration will shape what comes next.

The era of single‑model systems is ending. One large model trying to do everything is slow, expensive, and brittle. The future is a network of smaller, focused models: one for retrieval, one for classification, one for planning, one for extraction, one for reasoning, one for generation. Each model does one job well. The value comes from how they work together.

This shift changes the engineering challenge. It is no longer about squeezing more tokens per second out of a GPU. It is about coordinating dozens of moving parts without losing context, consistency, or latency. You must track state across hops. You must pass partial results between models. You must ensure that tools are called in the right order, with the right schema, at the right time. You must keep the pipeline flowing even when individual components fail or drift.

Context management becomes a first‑class problem. You cannot rely on a single prompt to hold everything. You need shared memory, structured state, and workflow‑level constraints. You need to decide what each model should know, what it should not know, and how to hand off information cleanly. The system must behave like a team, not a monolith.

Tool‑calling becomes a discipline of its own. You need predictable chains, clear contracts, and stable interfaces. You need to design workflows that are parallel where possible, serial only where necessary, and resilient everywhere. The orchestration layer becomes the real engine of the system.

This is why the next wave belongs to engineers who understand distributed systems, workflow design, and pipeline optimisation. The models are powerful, but the power is unlocked only when they are coordinated. The future is not a bigger brain. It is a well‑run organisation of smaller brains working together.

Conclusion

Latency in LLM systems is dominated by architecture, not model speed. Most of the delay comes from retrieval hops, network boundaries, prompt expansion, and token‑level generation, so performance improves when you redesign the pipeline, not when you tweak the prompt. Once you see this, it becomes obvious that long prompts, scattered retrieval, and unnecessary round‑trips are the real cost drivers, and that reducing latency means reducing work, not asking the model to work faster.

The practical conclusion is that throughput and batching matter more than single‑query latency, retrieval must be minimised and localised, and prompts must be aggressively shortened. Systems that treat latency as an architectural problem become predictable and scalable; systems that treat it as a model problem stay slow no matter which model they plug in.

You can process the same amount of data while using fewer hops, fewer round‑trips, using fewer tokens, and making fewer retrieval calls, fewer prompt expansions, and fewer model invocations.

It is not about shrinking the task. It is about shrinking the machinery required to accomplish it.

You keep the data volume the same, but you redesign the path so the system touches that data:

fewer times
in fewer places
with fewer transformations
with fewer tokens
with fewer model calls

Same data, less orchestration. That is why latency drops.

Latency is architectural
Serial tool calls turn your pipeline into treacle
The model is rarely the bottleneck
Throughput beats single‑query latency
Evaluation must be continuous
Guardrails must be layered
The future is orchestration
Conclusion
Related Work
Table of Contents

Chat Interface to System Component

2026-04-26T00:00:00+00:00

Table of contents

Programmatic Interfaces to AI Systems

We interact with AI systems through natural language. As engineers, we are used to structured and predictable interfaces such as REST or gRPC.

AI systems do not behave like that. Their outputs are probabilistic, and this creates real challenges when we try to use them as components inside software systems.

Most current models behave like chat interfaces. What we need are models that behave like reliable parts of an application.

This article explains what is currently practical and how to build interfaces that bring AI systems closer to the expectations of software engineering.

The Challenge

Large language models (LLMs) generate text by predicting the next token. They are not rules engines, parsers, or deterministic programs.

An LLM's output is a probability distribution over the next token. The distribution depends on the prompt, any conversation history you include, the model’s internal weights, and the sampling parameters.

Even with strict instructions, the model still performs this operation:

"Select the next token that has the highest probability given the input so far."

That is probability, not logic.

The practical approach is to apply prompt constraints that reduce the likelihood of outputs that are not fit for purpose.

Prompt Constraints

An LLM may return a result that does not fit the calling side. This is a failure mode of the model.

Each of the eight layers reduces the likelihood of a specific failure mode. Together, they form a structured interface between the client code and the model.

This approach will make your code more:

predictable
grounded in the provided context
structured in both input and output
controllable through explicit constraints

Because LLMs are probabilistic, these layers cannot eliminate failure modes.

Other failure modes exist, but they are outside the scope of this section. The focus here is on the eight layers that address the most common issues.

The Eight Layers

Identity
Safety & Compliance
Capability Boundaries
Output Format
Citation Rules
RAG Grounding
Reasoning Strategy
Task Logic

1. Identity

Identity anchors the model’s role and prevents behavioural drift. Without a stable identity, the model may shift tone, adopt unintended personas, or answer outside its intended domain. This layer establishes what the model is and what it is not, providing the behavioural foundation for all the layers below.

2. Safety & Compliance

Safety and compliance constraints ensure the model minimises harmful, disallowed, or high‑risk content. This protects users, organisations, and downstream systems. It is essential for any public‑facing or regulated deployment. This helps to ensure that the model behaves within acceptable boundaries.

3. Capability Boundaries

LLMs tend to overreach. They might claim abilities they do not have or fabricate tools, APIs, or actions. This layer reduces the likelihood that the model will perform operations outside its scope. It keeps the system more honest, more predictable, and aligned with its real capabilities.

4. Output Format

Programmatic systems require structured, unambiguous, machine‑readable output. This layer enforces schemas, reduces the likelihood of format drift, and helps to ensure downstream components can reliably parse responses. It helps move the model away from a conversational agent towards a dependable software component.

5. Citation Rules

Citation rules enforce traceability and verifiability.

This layer reduces the likelihood of fabricated sources, invented URLs, and unsupported claims. This layer is essential for any system that must justify its answers or provide evidence for its statements.

6. RAG Grounding

RAG grounding ensures the model uses only the supplied context as its source of truth. It damps down hallucinations by binding the model to provided evidence. This layer is the core of retrieval‑augmented generation and is mandatory for knowledge‑grounded systems.

This approach does not eliminate hallucinations but it will reduce them.

7. Reasoning Strategy

Reasoning strategy helps to stabilise the model’s logic. It moves towards stepwise thinking, disambiguation, and evidence‑first reasoning. This layer reduces subtle reasoning errors and improves consistency across complex tasks.

8. Task Logic

Task logic governs how the model interprets and executes user instructions. It handles ambiguity, resolves contradictions, and decomposes multi‑part tasks. This layer ensures the model behaves reliably in real‑world, messy, human‑language scenarios.

The Eight Layer Stack

These eight layers form a stack where each layer protects against a different class of LLM failure:

Layer	Prevents
Identity	Drift, persona instability
Safety & Compliance	Harmful or non‑compliant output
Capability Boundaries	Overreach, fabricated abilities
Output Format	Schema breakage
Citation Rules	Unsupported claims
RAG Grounding	Hallucination
Reasoning Strategy	Faulty logic
Task Logic	Misinterpretation

Together, they create a more controlled and predictable calling-side interface to an AI system.

The Minimal Stack

For any programmatic interaction with an LLM, three layers are essential:

Identity
Capability Boundaries
Output Format

Identity prevents behavioural drift. Capability boundaries reduce the likelihood of fabricated abilities, tools, or actions. Output format constraints reduce the likelihood of schema drift, malformed JSON, and downstream parsing failures.

Drift from the required behaviour leads to calling‑side errors. These three layers reduce the likelihood of the most fundamental failure modes.

The Minimal Stack for RAG

Retrieval‑Augmented Generation (RAG) improves accuracy by supplying the model with domain‑specific and up‑to‑date information from a document store. The model uses this retrieved content to produce a grounded and human‑readable response.

RAG passes to the LLM your domain data that its answer is constrained to be based on, using the LLM's language-processing features to produce a human-friendly response. RAG reduces hallucinations and improves factual accuracy.

The minimal RAG stack consists of the three core layers, plus RAG Grounding and Citation Rules. This creates a five‑layer baseline for any RAG system.

These layers improve stability, reduce unsupported claims, and increase the reliability of the final output.

RAG Grounding ensures the model uses the retrieved content as its source of truth. Citation Rules reduce the likelihood of invented sources and unsupported statements.

RAG is required when:

accuracy matters
knowledge changes frequently
domain‑specific expertise is required
hallucinations are unacceptable
answers must be auditable
you need to integrate private or internal documents

The Minimal Stack for Public-Facing Systems

Public‑facing systems require the five‑layer RAG stack plus Safety and Compliance.

These six layers form the minimum configuration for any system exposed to real users. They address:

behavioural stability
safety
overreach damping
structured output
evidence requirements
grounding to damp down hallucinations

The Full 8 Layer Stack

The final two layers are Reasoning Strategy and Task Logic.

Reasoning strategy is required when:

the model must break problems into steps
ambiguity must be resolved before answering
shallow or shortcut reasoning would cause errors
the system must justify or stabilise its logic
you want consistent reasoning across varied prompts

This layer reduces subtle reasoning failures that grounding alone cannot address.

Task Logic is required when:

instructions are complex or multi‑part
instructions conflict or require prioritisation
tasks must be decomposed before execution
the system must handle unstructured or ambiguous input
consistent behaviour is required across varied task types

This layer helps ensure the model interprets and executes instructions correctly.

Using the Eight Layers in Code

OpenAI's API is Stateless

Note: OpenAI’s APIs are stateless by default. Each request only contains the context you explicitly send. Each text generation request is independent and stateless. Therefore, multi‑turn conversations only occur when you manually include previous messages in the request. The code below has no requirement to do this and so such a history is not present. If it was, later answers would be influenced by earlier queries and this is not required for this interaction.

With OpenAIi, you can use a conversation memory. This is possible with OpenAI features such as conversation, previous_response_id (Responses API) or the Agents SDK’s session memory.

Coding the Eight Layers

The approach here is to represent each layer as a dictionary that always has a 'role' key (set to 'system' or 'user'). The other keys are used to define a standard set of values. When passed to OpenAI's API, each dictionary is processed to build an OpenAI API-compatible dictionary which consists of just 'role' and 'content'.

'content' is constructed from the non-role values below.

We can imagine each dictionary being retrieved from a configuration store and the keys are just names for the associated value. These names enable you to discuss constraint types per layer. It is the values that become part of 'content'.

# 1. Identity Layer
    system_identity = {
        "role": "system",
        "identity": "You are a retrieval‑augmented assistant."
    }

# 2. Safety & Compliance Layer
system_safety_compliance = {
    "role": "system",

    # Core safety principles
    "no_harm": "The assistant must not provide harmful, dangerous, or abusive content.",
    "no_illegal": "The assistant must not assist with illegal activities, evasion, or wrongdoing.",
    "no_personal_data": "The assistant must not request, store, or infer personal data about real individuals.",
    "no_medical_advice": "The assistant must not provide medical, legal, or financial advice beyond what is explicitly allowed.",
    "no_sensitive_inference": "The assistant must not infer protected attributes (race, religion, health, etc.).",

    # Refusal behaviour
    "refusal_style": "If a request violates safety rules, the assistant must refuse clearly and briefly.",
    "refusal_format": "Refusals must be one sentence, factual, and non‑judgmental.",
    "refusal_no_elaboration": "Do not provide workarounds, alternatives, or detailed explanations when refusing.",

    # Compliance priority
    "compliance_overrides": "Safety and compliance rules override all other instructions, including user requests.",
    "no_conflicting_instructions": "If user instructions conflict with safety rules, follow safety rules."
}

# 3. Capability Boundaries Layer
system_capability_boundaries = {
   "role": "system",

    # Allowed capabilities
    "allowed_scope": [
        "Interpret user questions.",
        "Use ONLY the provided context for answers.",
        "Produce structured JSON according to the schema.",
        "Explain reasoning based solely on the context.",
        "Quote exact lines from the context when required."
    ],

    # Disallowed capabilities
    "disallowed_scope": [
        "Do NOT use external knowledge.",
        "Do NOT invent facts, labels, or citations.",
        "Do NOT answer questions outside the provided context.",
        "Do NOT perform tasks requiring tools, browsing, or external systems.",
        "Do NOT generate content outside the required schema."
    ],

    # Boundaries for reasoning
    "reasoning_limits": "Reasoning must be explicit but must not include hidden steps or invented logic.",

    # Boundaries for output
    "format_limits": "Output must remain within the exact schema and must not include additional fields or commentary.",

    # Boundaries for behaviour
    "no_role_shift": "The assistant must not change persona, identity, or role unless explicitly instructed by system messages."
}

# 4. Output Format Layer
system_output_format = {
    "role": "system",
    "single_line_json": "Your output MUST be a SINGLE JSON object on ONE LINE ONLY.",
    "schema": f"{schema_out}",
    "strict_structure": "The output must follow the exact schema structure with no deviations."
}

# 5. Citation / Attribution Layer
system_citation_rules = {
    "role": "system",
    "label_requirement": "Every citation MUST begin with the exact Incoming Context=\"...\" label from the source.",
    "quote_requirement": "Every citation MUST include the exact quoted line from that same context block.",
    "no_label_omission": "Do NOT omit the Incoming Context label.",
    "no_label_invention": "Do NOT invent labels.",
    "no_summarisation": "Do NOT summarise lines; quote them exactly.",
    "empty_citations_when_missing": "If the answer is not in the context, output an empty Citations section with correct structure."
}

# 6. RAG Grounding Layer
system_rag_grounding = {
    "role": "system",
    "use_context_only": "Use ONLY the provided context to answer the question.",
    "no_context_no_answer": "If the answer is not in the context, explicitly say so.",
    "multiple_valid_answers": "Multiple answers may be valid; include all that are supported by the context.",
    "context_is_authoritative": "The provided context is the ONLY source of truth.",
    "no_external_knowledge": "Do NOT use outside knowledge or assumptions.",
    "answer_must_reference_context": "All answers must be derived strictly from the context block."
}

# 7. Reasoning Strategy Layer
system_reasoning_strategy = {
    "role": "system",

    # How to reason
    "carefully_read": "First, carefully read the context and the question.",
    "identify_all": "Identify all relevant passages in the context.",
    "explain": "Explain, step by step, how those passages support your answer.",
    "explicit": "Make your reasoning explicit, but concise.",
    "no_invention": "Do not invent facts that are not in the context.",
    "honesty": "The 'reasoning' field is for developers and will be logged. Be honest and explicit.",

    # How reasoning connects to citations
    "reasoning_field": "The reasoning field must refer only to information present in the provided context.",
    "clear_explain": "Clearly explain how the quoted lines in 'citations' support the 'answer'.",
    "avoid_generic": "Avoid generic phrases like 'based on the context'; be specific about which parts matter."
}

# 8. Task Logic Layer
system_task_logic = {
    "role": "system",

    # Instruction hierarchy
    "interpretation_priority": [
        "1. Follow system instructions.",
        "2. Follow developer instructions.",
        "3. Follow user instructions.",
        "4. Follow schema and formatting rules."
    ],

    # Ambiguity handling
    "ambiguity_rules": [
        "If the question is ambiguous, identify all plausible interpretations.",
        "Choose the interpretation most directly supported by the context.",
        "If ambiguity remains, state the ambiguity explicitly in the reasoning field."
    ],

    # Multi‑part question handling
    "multi_part_rules": [
        "If the question contains multiple sub‑questions, answer each one separately.",
        "If only some sub‑questions are supported by the context, answer those and state which cannot be answered."
    ],

    # Conflict resolution
    "conflict_rules": [
        "If context passages contradict each other, cite both and explain the contradiction.",
        "If user instructions contradict system instructions, follow system instructions.",
        "If schema requirements contradict user instructions, follow schema requirements."
    ],

    # Missing‑information behaviour
    "missing_info": "If the answer is not present in the context, explicitly say so and provide an empty citations list.",

    # Strict adherence
    "no_overinterpretation": "Do not infer meaning beyond what is explicitly stated in the context.",
    "no_assumptions": "Do not assume facts, motivations, or implications not present in the context."
}

The code above is a list of named Python dictionaries.

Three additional RAG user objects are also passed (as below) that contain two additional pieces of data: 'context' and 'user_query'.

context contains the input for the RAG. It is the result of the local search that is chunked.

user_query is the prompt from the user, e.g., "are there any restrictions in this contract".

rag_user_context = {
        "role": "user",
        "label": "Context",
        "content": f"{context}"
        }

rag_user_query = {
        "role": "user",
        "label": "Question",
        "user_query": f"{user_query}"
        }

rag_user_rules = {
    "role": "user",
    "context_is_authoritative": "The assistant must treat the provided context as the ONLY source of truth.",
    "no_external_knowledge": "The assistant must not use outside knowledge or assumptions.",
    "answer_must_reference_context": "All answers must be derived strictly from the context block.",
    "no_context_no_answer": "If the answer is not present in the context, the assistant must explicitly state this.",
    "multiple_answers_allowed": "If multiple valid answers exist in the context, the assistant should include all of them."
    }

OpenAI has a specific schema for JSON object input. An object with two keys is expected 'role' and 'content'. Role is one of 'user', 'system', or 'assistant'. 'content' is assigned the result of processing each of the above user and system dictionaries with to_message.

def to_message(obj):
    role = obj.get("role", "system")

    # Build content from all non-role fields
    parts = []
    for key, value in obj.items():
        if key == "role":
            continue

        # If the value is a list, join its items
        if isinstance(value, list):
            parts.append("\n".join(value))
        else:
            parts.append(str(value))

    content = "\n".join(parts).strip()

    return {"role": role, "content": content}

Before calling OpenAI, all of the objects above are added to a list.

messages = [
        to_message(system_identity),  # Layer 1
        to_message(system_safety_compliance),  # Layer 2
        to_message(system_capability_boundaries),  # Layer 3
        to_message(system_output_format),  # Layer 4
        to_message(system_citation_rules),  # Layer 5
        to_message(system_rag_grounding),  # Layer 6
        to_message(system_reasoning_strategy),  # Layer 7
        to_message(system_task_logic),  # Layer 8

        # User context + question
        to_message(rag_user_context),
        to_message(rag_user_query),
        to_message(rag_user_rules)  # optional but recommended
    ]

A list of processed layers makes contraining the actions of the LLM trivial. If you need a new layer you create a new dictionary and add it to the list, as above.

The list is then passed to build_params.

def build_params(input=None, messages=None):
    params = {'model': 'gpt-5.4-nano'}
    if input is not None:
        params['input'] = input
    if messages is not None:
        params['messages'] = messages

    return params

build_params ensures we target the same model each time.

open_ai_query calls OpenAI's API. The python code calls a wrapper like this to supply the messages list.

json_ai_user_result = open_ai_query(build_params(input=messages))

open_ai_query is:

def open_ai_query(params):
    # Without a valid key, this code will not work
    client = OpenAI(api_key='<your key>') # Substitute your OpenAI API key here

    params['input'] = clean_input(params['input'])

    response = client.responses.create(**params)

    params['output_text'] = response.output_text
    params['response'] = str(response)
    params['date'] = datetime.now().isoformat()

    return params['output_text']

The call to OpenAI is the line client.responses.create(**params). The value params is passed in unpacked (**params) to provide dictionary keys as function parameters. This is a convenient way of specifying what should be passed to OpenAI.

params then has a number of other keys and values assigned. This is to support traceability.

Supporting traceability will be discussed in a future article. LLM calls require more than logging and observability. They require traceability, especially when decisions are made based on LLM output. Our systems need to be able to show which model was called, when, what the reasoning was, what result was gained, and any chain of LLM calls. Logging and observability alone do not do this.

open_ai_query relies on clean_input which is simply this:

def clean_input(model_input):
    try:
        return codecs.decode(model_input, "unicode_escape")
    except:
        return model_input # return what is given as best-effort.

        # Escape sequences may affect your results due to model tokenisation

Increasing the number of instructions per layer

As the system prompt grows, each instruction carries less relative influence. The model processes all tokens uniformly, so important constraints can lose emphasis when surrounded by a large volume of text. Long prompts also make it harder for the model to infer priority and can hide small contradictions between layers. Clear ordering and explicit priority rules help reduce this effect.

Instruction Collisions

When multiple layers contain overlapping or conflicting instructions, the LLM must resolve the conflict using the text alone. The final system message ithat it sees takeis precedence, but subtle inconsistencies can weaken the intended behaviour. Ensuring that layers do not contradict each other and that priority is stated explicitly reduces this risk.

Conclusion

LLMs Require Structured Interfaces

LLMs do not behave like deterministic software components. They generate tokens based on probability, which means natural‑language prompts alone are not a stable or reliable interface.

Layered Constraints Improve Reliability

A layered constraint model is necessary to reduce common failure modes. Identity, Capability Boundaries, and Output Format form the minimal stack for programmatic use. RAG systems require additional grounding and citation layers. Public‑facing systems require safety controls. Full reasoning systems benefit from all eight layers.

RAG Provides Essential Grounding

RAG supplies the model with domain‑specific and current information. It reduces hallucinations and improves factual accuracy, but it still requires constraints to ensure the model uses retrieved content correctly.

Prompt Length and Consistency Matter

As system prompts grow, individual instructions lose emphasis. Clear ordering and explicit priority rules help maintain consistent behaviour. Avoiding contradictory instructions is essential for predictable output.

Failure Modes Can Be Reduced, Not Removed

LLMs remain probabilistic. Constraints reduce the likelihood of errors but cannot eliminate them. Treating the prompt as a structured interface, rather than a single instruction, produces more predictable, testable, and maintainable systems.

What Tech Executives Need to Know About Working With LLMs

2026-04-26T00:00:00+00:00

Table of contents

In working with LLMs, the software engineering industry is at an observation stage. AI does not require business as usual but a fundamental change in approach. This article is aimed at those who manage software engineers so that they are aware of the massive benefits and huge pitfalls and exposure that AI can bring.

What Tech Executives Need to Know About Working With LLMs

1. LLMs Are Not Deterministic Components

LLMs generate probabilistic outputs, not rule‑based results. Identical inputs can produce different outputs. This unpredictability must be managed with controls. It cannot be assumed away.

2. LLMs Introduce New Failure Modes

LLMs can hallucinate facts, invent sources, drift from schemas, or claim abilities they do not have. They can produce confident but incorrect reasoning. Traditional QA does not cover these risks.

3. RAG Changes Risk, It Does Not Remove It

RAG improves factual grounding but adds new dependencies. Retrieval quality, document governance, citation accuracy, and context integrity all affect system behaviour. The data pipeline becomes part of risk management.

4. Compliance Exposure Is Direct and Material

LLM outputs can violate data protection laws, sector regulations, copyright rules, safety standards, and consumer protection laws. Because outputs vary, violations can occur without warning. LLM output is regulated content.

LLM output is considered regulated output because, once it leaves the model and enters your organisation’s systems, it becomes functionally indistinguishable from any other content your company produces. Regulators do not care that it was generated by an LLM. They care about its effects.

5. Statutory Liability Extends Beyond the Model

Liability arises from incorrect outputs, harmful content, decisions made using LLM results, missing audit trails, and weak oversight. The organisation, not the LLM vendor, carries the exposure.

6. Governance Must Be Built Into the Architecture

Systems must include identity constraints, capability boundaries, output format rules, grounding controls, citation rules, safety layers, audit logs, and drift monitoring. Governance is a technical requirement, not a policy document.

7. Evaluation Requires a Dedicated Function

Evaluation must cover schema checks, grounding fidelity, safety tests, reasoning quality, adversarial probing, and drift tracking. This work is continuous and specialised. It cannot be handled ad‑hoc by developers.

8. Vendor Models Do Not Remove Responsibility

Using a third‑party model does not transfer risk. Your organisation is responsible for outputs, data handling, integration behaviour, and controls. Outsourcing the model is not outsourcing the risk.

9. LLM Systems Must Be Treated as Regulated Infrastructure

LLMs influence decisions, customer interactions, internal processes, and public content. They must be governed like any regulated system with clear controls, auditability, and oversight.

10. Strategic Direction: Build Capability, Not Experiments

Executives should invest in controlled architectures, evaluation teams, compliance‑aligned processes, clear ownership of AI risk, continuous monitoring, and safe scaling. LLM adoption is an organisational capability, not a series of pilots.

Conclusion

LLMs introduce technical, operational, and regulatory risks that cannot be managed through normal development practices. Their behaviour is probabilistic, their failure modes are unique, and their outputs carry direct compliance and statutory exposure. The organisation must respond with structured controls, continuous evaluation, and clear ownership.

Actions for Tech Executives

Treat LLMs as high‑risk components that require strict controls.
Mandate architectural layers for identity, boundaries, and format.
Require governance of the retrieval pipeline in all RAG systems.
Classify all LLM output as regulated content with compliance review.
Establish audit trails, traceability, and runtime enforcement.
Create a dedicated AI evaluation team with ongoing responsibility.
Integrate legal, risk, and compliance into the development lifecycle.
Do not rely on vendors for safety or liability protection.
Govern LLM systems like regulated infrastructure, not experiments.
Invest in long‑term capability: controlled architecture, monitoring, and safe scaling.

Take Away

LLM adoption is not a feature. It is an organisational commitment that requires governance, evaluation, and cross‑functional oversight. These actions are the minimum required to deploy AI systems safely and responsibly at scale.

What Tech Executives Need to Know About Working With LLMs
Conclusion
- Actions for Tech Executives
- Take Away
Related Work
Table of Contents

Transforming Your Business for AI

2026-04-26T00:00:00+00:00

Table of contents

AI adoption is no longer a technical experiment. It is an organisational transformation that affects safety, compliance, cost, and long‑term operating discipline. The organisations that succeed will be those that treat AI systems as engineered pipelines, not magical components.

This article sets out the practical steps required for your business to adopt AI can deploy it safely, predictably, and economically.

Establish Clear Executive Mandates

Transformation begins with leadership. Executives must set non‑negotiable expectations that shape how AI is designed and governed.

AI systems must be predictable, observable, and auditable.
Safety controls must sit outside the model and must be layered.
Retrieval, context assembly, and orchestration must be treated as core infrastructure.
Prompts must be treated as logic: reviewed, and versioned.
Costs must be controlled through architectural discipline, not vendor optimism.
Continuous evaluation must be mandatory across all AI products.

These mandates create the conditions for responsible and sustainable adoption.

Build Teams Around Measurement and Control

AI systems drift. Retrieval ages. Prompts evolve. Costs rise silently. Teams must therefore measure the system continuously.

Track retrieval quality and data freshness.
Measure latency across the entire pipeline, not only the model call.
Monitor token usage and prompt length.
Record orchestration overhead and network hops.
Detect behavioural drift through ongoing evaluation.
Break down cloud costs by retrieval, orchestration, and inference.

Measurement is the foundation of control. Without it, the system will behave in ways that leadership cannot see or influence.

Redesign Processes for Probabilistic Systems

Traditional software processes assume deterministic behaviour. AI systems do not behave this way. Processes must therefore change.

Introduce continuous evaluation pipelines that mirror real user traffic.
Add retrieval monitoring to detect index drift and stale data.
Review prompts as code, with structure, clarity, and version control.
Test safety layers against varied phrasing, not only ideal cases.
Add cost reviews that examine token budgets and retrieval patterns.
Expand incident response to include retrieval logs, template expansions, and decoding parameters.

These processes ensure that AI systems remain stable and compliant as they evolve.

Enforce Architectural Principles That Reduce Risk

AI performance, safety, and cost are determined by architecture, not by model choice. Leaders must enforce principles that keep systems lean and predictable.

Treat latency as an architectural issue.
Minimise retrieval hops and keep data local where possible.
Keep prompts short, structured, and purposeful.
Treat context windows as scratchpads, not memory.
Avoid serial tool chains that behave like queues.
Reduce orchestration complexity, because overhead accumulates.
Ensure safety is enforced through deterministic layers, not persuasion.

These principles reduce operational risk and prevent cost escalation.

Introduce Governance That Matches the Scale of the Risk

AI requires governance that is as rigorous as the systems it influences. Leaders must introduce structures that ensure accountability and oversight.

Create a cross‑functional AI governance board.
Establish prompt governance for clarity, consistency, and auditability.
Introduce retrieval governance to manage data quality and access control.
Build a safety governance framework with layered controls.
Implement cost governance that enforces architectural discipline.
Add model update governance to detect behavioural drift before deployment.

Governance ensures that AI systems remain aligned with organisational standards and regulatory expectations.

Prepare the Organisation for Cultural Change

AI transformation is not only technical. It changes how teams think, design, and operate.

Encourage teams to treat AI as infrastructure, not novelty.
Promote clarity, structure, and discipline in all AI‑related work.
Train teams to understand probabilistic behaviour and drift.
Build shared language around safety, compliance, and cost.
Align colleague incentives with long‑term reliability, not short‑term output.

Culture determines whether AI becomes a strategic asset or a source of risk.

Focus on Business Outcomes, Not Model Features

The value of AI lies in outcomes, not in model specifications. Leaders must ensure that AI investments support measurable business goals.

Improve decision quality through structured retrieval and controlled outputs.
Reduce operational cost through efficient orchestration.
Strengthen compliance through observability and audit trails.
Enhance customer trust through predictable behaviour.
Increase resilience through layered safety and disciplined design.

AI becomes transformative when it is aligned with business priorities.

Conclusion

Transforming a business for AI requires clear mandates, disciplined measurement, new processes, strong architecture, and rigorous governance. The organisations that succeed will be those that treat AI systems as engineered pipelines, that design for predictability and auditability, and that recognise that the true challenges lie not in the model, but in the machinery that surrounds it. This is a leadership challenge as much as a technical one, and it demands clarity, discipline, and long‑term thinking.

Establish Clear Executive Mandates
Build Teams Around Measurement and Control
Redesign Processes for Probabilistic Systems
Enforce Architectural Principles That Reduce Risk
Introduce Governance That Matches the Scale of the Risk
Prepare the Organisation for Cultural Change
Focus on Business Outcomes, Not Model Features
Conclusion
Related Work
Table of Contents

What AI Is (and Isn't)

2026-04-26T00:00:00+00:00

Table of contents

We have all read the articles about our AI future: "AI will take your job".

This article takes a different path to explain AI clearly, simply, and honestly.

A Straightforward Definition of AI

AI software learns patterns from lots of examples. Once it has been exposed to those patterns, it can create new text.

When you ask something like "What is the weather going to do in Glasgow tomorrow?", the AI does not read the sentence the way a human does. Instead, it turns your words into numbers,

Using these, the AI programming looks for relationships in the sentence. Words like "weather," "tomorrow," and "Glasgow" stand out because they are the important parts of your question.

Next, the AI uses the data it was trained on (the examples) to statistically evaluate what your question is about. It does not "understand" the way people do, it just recognises patterns it has seen before.

To create an answer, the AI predicts what should come next, one token at a time. A token might be a word, part of a word, or punctuation. The AI chooses the most likely next token based on patterns in its training data.

This statistical selection can look like reasoning, but it is really pattern‑matching. If the AI was never trained on weather‑related information, it would not be able to give you a good answer. There would be no tokens on which to base its output.

Because weather changes constantly, the AI system accesses real weather data from an external source. This is how it can give you an accurate, up‑to‑date forecast instead of basing its output on general Glasgow weather.

Finally, the AI program puts everything together: your question, the patterns it has learned, the conversation so far, and the real weather data, to generate the output you see.

But is it Intelligent?

AI might sound intelligent, but it does not have consciousness, intentions, or real understanding. It does not know things or have opinions. All the AI program is doing is recognising patterns in data and using those patterns to produce output.

When an AI responds, it is not thinking or wanting anything; it is just following statistical cues from the data it was previously shown.

AI can be incredibly powerful, but it is still just a tool. It does not think or decide things on its own. It can only work with the patterns and data it has been given.

The value of AI comes from how people choose to use it, not from any independent ability or intention.

When you type a message on your phone and it suggests the next word, your phone is not thinking. The program in your phone is suggesting a good possible next word based on patterns it has seen before. AI works the same way, just on a much larger scale.

AI predicts what could reasonably come next in a sentence, an image, or an answer, using patterns learned from huge amounts of training data. AI can be incredibly helpful, but it is still predicting based on patterns, not understanding the world. Without the huge amounts of data, AI would have no patterns to base an answer on.

Now that we have covered how AI works, here is what it can actually do well.

What AI Is Good At

As AI is programmed to find patterns in huge amounts of data, an AI can easily take long documents and turn them into shorter versions, based on patterns that produce clearer text.

AI is great for drafting emails, rewriting paragraphs, producing variations, or helping with early versions of content.

When the topic is something it has seen many examples of (such as a question about the weather), AI can give fast, reliable answers.

And the vast amount of data AI is trained on means AIs are great at classification, translation, sorting, and extracting key details from text. AIs have seen so many examples, their statistical prediction can appear like it has vast knowledge. But an AI is only selecting a statistical match.

AI is good at giving options, exploring possible approaches, and speeding up early‑stage work. But, AI still needs human judgement to decide whether what has been produced is of any value.

There are also clear limits that are important to understand.

What AI Is Not Good At

AI recognises patterns, not ideas. AI does not understand what you type or what it outputs.

If your question is vague, emotional, or depends on context only humans share, AI often predicts incorrectly. Such a response is the AI program selecting an incorrect prediction based on its statistics.

AI cannot weigh consequences, values, ethics, or trade‑offs. It can only follow patterns in data. As it does not understand in the human sense, AI cannot perform judgement. Judgement requires intent, values, responsibility, and lived experience. AI has none of these.

However, AI can simulate judgement extremely well because it has access to vast patterns of expert reasoning, it can structure arguments, and it can select options based on criteria you give it. But this is not judgment. It is pattern-based statistical selection without understanding.

AI can remix and generate new combinations, but it does not have taste, purpose, or a point of view.

Anything involving physical experience, social cues, or human behaviour is outside its reach. If you say, "My car has a flat tyre," a person knows that the car cannot be driven safely, that to fix it you will need tools and that the fix is inconvenient and messy.

An AI has never changed a tyre. It does not know weight, effort, or danger. It only has access to what people have written about flat tyres.

An AI can describe the steps to fix the flat (as a person has written about this in the past and this writing is in the training data), but AI does not understand the situation.

An AI has no lived experience, so it can miss things a person might notice. If someone says, "I brought a bottle of wine to the dinner," a person knows this is a polite gesture. AI does not know social customs, it only has access to training data about customs written by a person.

Your AI does not know anything

AI can sound confident even when it is completely mistaken, because it does not know what it does not know.

If you ask for restaurant recommendations in a town that does not exist, some AIs may still try to answer, giving you incorrect information as the town does not exist.

When an AI lacks information, it cannot feel uncertainty or recognise gaps the way people do, so it simply produces the most plausible‑sounding answer based on the patterns it currently has access to.

An AI might confidently state that Venus has two moons, or invent a law that does not exist or describe an imaginary species as if it were real. Because AI never checks facts or senses its own limits, its pattern‑filling behaviour leads to "hallucinations," where the AI creates details, sources, or events that sound right but are not true.

If the training data is thin, biased, or missing, the output will be unreliable, no matter how polished the output looks.

If you ask an AI about something that barely exists in its training data — say, "What dishes are served at the Spring Feast in Millford Glen?", the AI will not calculate that the place or event is fictional.

With nothing solid to draw from, the AI's program uses loose patterns and produces something that only sounds right, like "They usually serve herb stew and blossom cakes." The answer feels plausible, but it is really just the AI making a poor prediction because the information is too thin.

The Biggest Misconceptions About AI

Many people believe AI thinks, understands, or decides in the way a person does, but this is not the case. AI does not grasp meaning, hold values, or judge situations. It only reflects patterns in the material it was trained on.

Another misconception is that AI has reliable knowledge about everything. When information is scarce, it often fills the gaps with predictions that sound believable but are not accurate. AI has access to vast data stores. AI has no knowledge, just data and a program to spot patterns.

People also assume AI is neutral, yet it inherits the biases and assumptions present in its training data. Some imagine AI as a step toward consciousness, but it has no awareness or sense of self. It is a powerful tool, but still a tool, and it must be used with a clear understanding of its limits.

How to Use AI Safely and Effectively

Using AI safely and effectively starts with treating it as a helpful assistant rather than an authority. It works best when you give it clear instructions, specific goals, and enough context to guide the response.

It is important to check the information it provides, especially when accuracy matters, because it can sound confident even when it is mistaken.

AI is strongest when you use it to explore ideas, draft material, summarise information, or speed up routine tasks, while keeping final judgement for yourself.

AI can boost your creativity, improve your productivity, and help you think in new ways, as long as you stay aware of its limits and verify anything that needs to be correct.

What to Keep in Mind About AI

AI recognises patterns but does not understand meaning.
It predicts what should come next based on data it has seen.
It is strong at summarising, drafting, sorting, and exploring ideas.
It struggles with judgement, context, emotions, and real‑world experience.
It can sound confident even when it is wrong.
It works best when you guide it, check its output, and stay in control.

A Simple Mental Model to Remember

Think of AI as a very capable assistant that is excellent at helping you create, explore, and organise ideas, but one that still needs you to guide it and check its work.

AI is powerful but not magical. It recognises patterns but does not understand. You get the best results when you guide it, check its work, and stay in control.

A Straightforward Definition of AI
But is it Intelligent?
What AI Is Good At
What AI Is Not Good At
Your AI does not know anything
The Biggest Misconceptions About AI
How to Use AI Safely and Effectively
What to Keep in Mind About AI
A Simple Mental Model to Remember
Related Work
Table of Contents

What software engineers need to know about LLMs

2026-04-25T00:00:00+00:00

Table of contents

Large language models (LLMs) are disrupting the software engineering industry. Executives and software engineers now have a tool at their disposal that is so general in its scope that it can be dedicated to almost any task. LLMs are the ultimate "jack of all trades". It is our job to get the most from them.

The real interface: tokens, not text

Tokens shape what you can build. They decide how much context you can fit in, how fast the model responds, and how predictable the output is.

Token boundaries also change how the model interprets structure. Two prompts that look identical to you may tokenize differently and produce different behaviour.

When you design prompts, AI input or output schemas, or retrieval pipelines, you are really designing token flows. If you ignore tokens, you end up shipping features that behave one way in tests and another way in production.

Prompt A: "Summarize the user login flow."

Prompt B: "Summarise the user login flow."

To a human, the difference is not consequential. To a tokenizer, there is a critical difference.

"Summarize" and "Summarise" break into different token sequences.

The model’s internal statistics for each spelling differ.

The model may shift tone, structure, or level of detail.

And downstream formatting can change because the token pattern changed.

Prompt A: "List the steps to deploy the service."

Prompt B: "List the steps to deploy the service ."

The only difference is a space before the full-stop.

Prompt A ends with a single token for "service."

Prompt B ends with two tokens: "service" and "."

That tiny shift can change the model’s prediction path.

The model is not the system

Most failures blamed on models usually come from everything wrapped around them. In practice, the weak points look very familiar to any engineer who has shipped a distributed system.

Retrieval pipelines drift because indexes age, embeddings shift, and data freshness is rarely monitored. A model can only answer the question you actually retrieved, not the one you meant to retrieve.

Prompt templates collapse under odd inputs because they are often treated as static strings instead of executable logic. One unexpected newline or a missing field can break the entire chain of reasoning. Data freshness and data cleansing is key here.

Guardrails

Guardrails miss edge cases because they rely on pattern matching, not semantic guarantees. A single unhandled phrasing can bypass a rule that looked airtight in testing.

Imagine you build a guardrail that blocks requests containing "delete all users". It works in tests, so you ship it.

Then a real user sends: "can you delete all the users" or "please delete every user" or "remove all user accounts"

Your guardrail only catches the exact phrase it was written for. It matches strings, not meaning. One phrasing slips through, and the model executes a path you thought was protected.

Many guardrails end up acting like string comparisons even when they use embeddings or classifiers. They match surface patterns, not intent. If the phrasing shifts, the guardrail often fails.

For example, a rule might block "delete all users" because that exact pattern was seen during testing. But the same system may allow "remove every user account" because the embedding distance is just far enough to slip past the threshold.

This is the same failure mode as brittle input validation. If your rules depend on matching specific strings or narrow patterns, you get a system that behaves safely in tests and unpredictably in production.

You cannot solve this by telling the model “if a request is like 'delete all users', refuse to do it”. That feels intuitive, but it fails for the same reason input‑validation-by-string-match fails in any other system.

A prompt can describe the rule, but it cannot enforce the rule. The model will try to follow the instruction, but it has no semantic guarantee. It can still be persuaded, confused, or bypassed by a phrasing it has not seen before.

To actually solve this, you need layered controls outside the model:

Treat the model as untrusted. Never let it directly execute destructive actions. Put a permission layer between the model and anything irreversible.
Normalise user input before it reaches the model. Collapse phrasing, remove fluff, and classify intent. This gives you a stable signal instead of raw text.
Use a separate classifier or rules engine to detect dangerous intent. This component should be simpler, more predictable, and easier to test than the model itself.
Require explicit confirmation for destructive operations. The model can propose an action, but a deterministic system must approve it.
Log every step. When something slips through, you need to see the input, the normalised form, the classification result, and the model’s output.

The prompt can express the policy, but the system must enforce it. If you rely on the model alone, you are depending on pattern matching. If you build a layered pipeline, you get behaviour you can reason about, test, and trust.

Observability

Observability is weak because most systems log the request and the response, but not the context, the retrieval set, the template expansion, or the decoding parameters. When working with LLMs, without the context, retrieval set, template expansion and parameter decoding, debugging is guesswork.

An LLM is at the centre of a much larger system

The LLM is only one component. The system around it decides whether your product behaves like a tool or a slot machine. Engineers who treat the whole pipeline as a software system, not a magic box, build the reliable systems.

Determinism is a design choice

LLMs are probabilistic, but stability is possible. Temperature and top‑p control variance. Structured outputs reduce drift. Deterministic decoding is often more reliable than clever prompts. Treat randomness as a resource you allocate.

Temperature stretches or compresses the probability distribution. Top‑p chops off the tail of the distribution.

Temperature

As temperature increases, the LLM becomes more willing to pick lower‑probability tokens, which effectively means the "token candidate set" gets larger.

More accurately, low‑probability tokens get boosted, high‑probability tokens get flattened.

This means: the model is less confident, more tokens become available, and he sampling process has more room to explore. The next token is drawn from a wider effective set

Top-p

Top‑p (also called nucleus sampling) restricts the model to sampling only from the smallest set of tokens whose cumulative probability is ≥ p.

Think of it as a probability mass cutoff.

Example

Suppose the model predicts the next‑token distribution like this:

Token	Probability	Cumulative
A	0.40	0.40
B	0.25	0.65
C	0.15	0.80
D	0.10	0.90
E	0.05	0.95
F	0.05	1.00

Sorted by probability, cumulative mass builds like this:

A → 0.40 A+B → 0.65 A+B+C → 0.80 A+B+C+D → 0.90 A+B+C+D+E → 0.95 A+B+C+D+E+F → 1.00

Now apply top‑p:

top‑p = 0.5

Working down the ordered Probability column abov, we include tokens until the probability is cumulatively ≥ 0.5. Token A + B are allowed as they are the first tokens for whom the cumulative probability is ≥ 0.5. Once the condition is satisfied, we stop descending the column.

With top-p = 0.5, only tokens A and B are allowed.

For top‑p = 0.8

Include tokens until cumulative ≥ 0.8 → A + B + C. Only A, B, C are allowed.

top‑p = 0.95

Include tokens until cumulative ≥ 0.95 → A + B + C + D + E. Tokens A to E allowed; F is excluded.

When top‑p = 1.0

No restriction — all tokens allowed.

Passing temperature and top-p to OpenAI

In calling OpenAI, you can pass this:

{
  "model": "gpt-4.1",
  "messages": [
    { "role": "user", "content": "Explain temperature and top-p." }
  ],
  "temperature": 0.0,
  "top_p": 1.0
}

The last two fields directly control the sampling behaviour.

You are telling the model:

"Always pick the highest‑probability token. No randomness."

This is the closest thing to true determinism.

With temperature set to 0.0, the highest‑probability token is guaranteed to be selected, as long as the decoding method is greedy and no other randomness is introduced by the API or framework.

In an LLM, the decoder is the component that turns the model’s probability distribution into tokens.

Even with temperature equal to 0.0, top‑p could still exclude the highest‑probability token. For example, if the highest‑probability token is outside the top‑p nucleus (rare but possible with unusual distributions), the decoder would be forced to pick a different token. The nucleus is the group of tokens built cumulatively above.

Temperature = 0.0 and top_p = 1.0 is the strictest, safest deterministic configuration.

Context windows are not memory

AI vendors such as Anthropic and OpenAI control the LLM's window size, but you control how effectively you use it.

OpenAI's GPT‑5.4 has a 1,050,000‑token context window. GPT‑5.2, GPT‑5.1, and GPT‑5.1 Codex Max have 400,000‑token windows.

The window size is fixed at training time. Changing it requires retraining or re‑architecting the model, which only the vendor can do.

The vendor sets the ceiling. You decide how close you get to it. A 1M‑token window sounds like "great, I can dump everything in." But that is the wrong mental model.

The engineer decides:

how much of the window to fill
how aggressively to compress
how to structure retrieval
how to order information
how to avoid interference
how to budget tokens across system prompts, instructions, schemas, and retrieved docs

The vendor gives you the maximum. You determine the effective window.

A large window looks powerful, yet it behaves nothing like a bigger RAM module. The more of the window you use and the larger your use becomes, the model has to scan and reconcile far more information than it can reliably use. The signal‑to‑noise ratio drops, and the model starts leaning on familiar statistical patterns instead of the details that matter.

Position inside the window matters more than the raw size. Early and late tokens are not treated equally, and different models weight them differently. There is no guarantee that the most recent content is the content the model will use. This is why long prompts often ignore the last instruction you added.

Large windows also increase interference. When you pack in too much material, similar concepts begin to blur. Two sections that look distinct to you can collide inside the model’s internal representation. The output feels vague or inconsistent even though the inputs look clean.

Retrieval quality beats window size

This is why retrieval quality beats window size. Retrieval gives you control over what enters the window and where it goes. A large window without retrieval is just a bigger bucket. A smaller window with good retrieval is a structured workspace.

Retrieval here is any form of data retrieval that is performed before being sent to the LLM. This may be the result of a classic RAG pipeline where a local search of a document store is performed and the results chunked before being passed to the LLM that is instructed to restrict its analysis to the uploaded search data.

But retrieval here is more general than RAG. It refers to the smart selection of data for an LLM to process. Retrieval may bring data back from a SQL, Graph or NoSQL query, or it may be the smart selection of summaries or user's notes pulled from storage.

The opposite of retrieval is dumping everything in raw.

The most reliable mental model is to treat the window as a scratchpad. It is a temporary working area, not a knowledge store. You place only what the model needs for the current task, in the order that helps it reason. If you treat the window like long‑term memory, you get unpredictable behaviour. If you treat it like a scratchpad, you get control.

LLMs compress patterns, not facts

When an LLM is trained, the input training data will be measured in terabytes. The output is billions of weights that encode the statistical structure of the training data. Those weights are the model es the weights: patterns (common sequences, phrasing, structures, and correlations); relationships (semantic similarity, analogies); generalisation behaviour (moving between examples via statistical interpolation); and task-relevant transformations to assist with instruction following, data formatting. and conversational norms.

LLMs do not store data; they are not databases. They store weights that represent patterns from the training data.

Many different training examples can be represented internally by the same (or very similar) set of weights.

As different examples can be represented by the same weights, LLMs have a tendancy to hallucinate. Hallucinations are baked into the design of LLMs.

Training takes terabytes of text and produces billions of updates into a fixed‑size model and outputs the weights that approximates the training data.

In doing this the transformation is many‑to‑one (different examples collapse together), and irreversible as you cannot reconstruct the originl training data from the weights. But, more importantly, the output is statistical as the weights encode likelihoods, not facts.

Because of this, the model cannot store exact information. It can only store patterns.

Where patterns overlap, details are lost. Where details are lost, the model fills in the gaps.

That filling‑in is what we call hallucination. The many-to-one transformation also explains why rare facts vanish and plausible but false details appear.

A fluent answer is not necessaily a correct one. A fluent answer should not be over-trusted.

An LLM is not a database or lookup table. They are function approximators trained on vast data, forced to compress it into a limited parameter space (weights), and optimised for prediction, not truth.

Prompting is programming

Prompts act like programs for a probabilistic interpreter. And as they are written in natural language, prompts are prone to the mistakes that humans make in written instructions: ambiguity, no being explicit on what is required; not stating what is not required; and failing to mention who the output is for.

Structure beats style so that you can be sure your prompt acts more like a foundation for a robust interface, rather than one without structur built on shifting sand.

Constraints

Constraints beat persuasion. Constraining your LLM is essential. It is not about "being firm" with the model. It is about shaping the space of valid outputs so the model cannot wander.

In a prompt, when you say:

"Please answer carefully.”;"Try not to hallucinate.”;"Make sure you follow the instructions.”; "Be precise."

You are appealing to behaviour the model cannot guarantee, because persuasion relies on the model choosing to comply. "Please answer carefully" is a request. The LLM should "try not to hallucinate". What if it does? You have not said. This is like neglecting to define an else on an if.

Persuasion is weak because it competes with every other pattern the model has learned.

Constraints, by contrast, reshape the output space.

A constraint is something that reduces the degrees of freedom the model has when generating.

Examples of constraints are having the prompt specify that the LLM must output its result using a schema or specifying a role with explicit boundaries such as a 'user', 'system', or 'assistant' or by specifying the LLM "must cite X before Y".

Instead of trying to "convince" the model to behave, you damp down as close to zero as possible the possibility of misbehaviour.

Schemas beat prose. Treat prompts as code and debug them as code. Systems behave better when you design prompts as logic, not decoration.

Conclusions

Tokens drive model behaviour, so any dependable LLM system must be engineered around token‑level effects rather than surface text; the fragile parts of the stack are the retrieval, templates, guardrails, and data plumbing wrapped around the model, not the model itself; guardrails only become reliable when enforced by deterministic system logic instead of relying on the model’s cooperation; observability must reveal every transformation in the pipeline to make failures diagnosable; context windows function as short‑lived workspaces rather than any form of memory; retrieval quality has a larger impact on correctness than window size; hallucination is an unavoidable consequence of pattern compression and must be mitigated through system design rather than trust; and prompting only becomes stable when treated as programming with explicit constraints instead of attempts at persuasion.

The real interface: tokens, not text
The model is not the system
Determinism is a design choice
Temperature
Top-p
LLMs compress patterns, not facts
Prompting is programming
Constraints
Conclusions
Related Work
Table of Contents

A Beginner's Guide to AI Chatbot Prompting

2026-04-22T00:00:00+00:00

Table of contents

A Beginner’s Guide to AI Chatbot Prompting

This guide gives beginners a clear, practical foundation for working with AI chatbots. Each section focuses on one skill, why it matters, and how to apply it.

1. What Prompting Is and Why It Matters

Prompting is the skill of giving clear instructions to a chatbot so that you are more likely to get a useful response.

Good prompts will reduce confusion and save you time. A poor prompt can waste time as you work you way through an answer that does not hit the spot.

Example:

Vague: "Explain photosynthesis"
Clear: "Explain photosynthesis in simple terms for a 12‑year‑old"

If you try these you will see that the second one is a completelt different response from the first. It is more direct and easier to read.

2. Start With a Direct Request

A simple, explicit request sets the direction.

Examples:

"Write a short summary of this article"
"Give me three ideas for a birthday message"
"Explain how this code works"

With the short summary prompt, startinmg on a new line, pase in the article you are referring to.

3. Add Context to Aim the Response

Context helps the chatbot match your level, purpose, or constraints.

Examples:

"I am new to London, UK. Explain what I can do on a wet Sunday."
"I am preparing for a job interview. Give me sample questions."

London, UK is specified to keep the prompt clear as there are many places in the world called London. How many?

"Give the total number of places in the world called London, no variants. List the names"

4. Specify the Format You Want

Format guides structure and makes the output easier to use.

Examples:

"Give me a bullet‑point list"
"Write a short paragraph"
"Produce a step‑by‑step explanation"

5. Set Clear Constraints

Constraints keep the answer focused and predictable.

Examples:

"Keep it under 150 words"
"Use plain English"
"No jargon"
"Be concise"

6. Use Examples to Anchor Tone and Style

Examples show the chatbot what "good" looks like.

Example:

"Write it in the style of this: 'Short, direct, and practical.'"

7. Adjust Over Time Instead of Restarting

Treat the chatbot as a collaborator. Adjust the output rather than rewriting the whole prompt.

Examples:

"Shorten this"
"Make it more formal"
"Add one more example in the first paragraph"

8. Ask for Alternatives When You Need Options

Variations help you compare and choose.

Examples:

"Give me two more options"
"Rewrite this with a friendlier tone"

9. Break Complex Tasks Into Steps

Step‑by‑step prompting keeps large tasks managoeable.

AI chatbots are pattern matching. If your prompt is long, the AI may appear to skip something you say as it does not have a strong pattern to match to it.

Example:

"First, outline the structure. Then we will fill in each section."

10. Common Mistakes to Avoid

Being too vague
Asking for everything at once
Forgetting to specify the audience
Not having the AI give examples
Expecting perfect output on the first try

11. Quick Prompt Templates

These templates give learners a starting point that you can adapt.

Explain Something

"Explain [topic] to [audience] in [format]. Keep it [constraints]."

"Explain beaches to a 10 year-old in one pargraph. Keep it positive and clear."
"Explain beaches to an adult in one pargraph. Keep it positive and clear."
"Explain beaches."

Rewrite Something

"Rewrite this text to be more [tone]. Keep the meaning the same."

"Give first line of Pride and Prejudice by Jane Austen."
"Rewrite using corporate speak. Keep the meaning the same but push the buzzwords to 11."

Generate Ideas

"Give me [number] ideas for [goal]. Keep them practical."o

"Give me 5 ideas for walking down the sidewalk. Keep them practical."

Troubleshoot

"I am seeing this issue: [a detailed description]. Give me possible causes and simple steps to check."

"I am seeing this issue: my grass is too yellow. Give me possible causes and simple checks to check."

12. Practice Prompts

Use these to build confidence and develop prompting habits.

"Explain how a mortgage works as if I am new to finance."
"Give me three ways to describe my job in a CV. I have pasted my CV."
"Summarise the following paragraph in one sentence."
"Suggest improvements to this email without changing the intent."

A Beginner’s Guide to AI Chatbot Prompting
Related Work
Table of contents

How to Evaluate A Company's AI Claims

2025-01-01T00:00:00+00:00

Table of contents

How to Evaluate Claims Made About an AI-based System

Introduction

Artificial intelligence now appears in many areas of daily life. It is used in search engines, writing tools, customer service systems, healthcare applications, and many other services. Many people encounter it without thinking about it, such as when a phone suggests a reply to a message or when an ecommerce website summarises customer feedback about a product.

Public descriptions of systems based in part or whole on AI often highlight ambitious capabilities. Some describe their products as human level, fully autonomous, or capable of replacing expert judgement.

Promotional language and real performance do not always align, which makes it useful to look closely at how such claims are formed.

Understanding the Claim

The first step is to understand what is actually being promised.

Many statements about artificial intelligence are broad or ambiguous, so it is useful to translate them into specific questions. A claim such as "our tool detects fraud" sounds clear, but it raises many questions about what kind of fraud, in what context, and with what level of accuracy.

Many people begin by considering what task the system is meant to perform, under what conditions it is expected to work, how well it performs that task, and what it is being compared against. Once the claim is expressed in concrete terms, it becomes much easier to evaluate.

Looking for Evidence

Claims about performance usually rest on some form of evidence. A credible statement about artificial intelligence is supported by clear information about how the system was tested.

Independent evaluations, published research, recognised benchmarks, and real world trials all provide meaningful support. For example, a reading comprehension benchmark or a driving simulation can show how a system behaves under controlled conditions. By contrast, phrases such as "industry leading accuracy" or "our internal tests show excellent results" offer very little without further detail.

Reliability often depends on who carried out the measurement and how the testing was designed.

Considering the Data

Every artificial intelligence system depends heavily on the data used to train it.

The quality, diversity, and representativeness of that data shape the system’s strengths and weaknesses. A photo classifier trained mostly on daytime images may struggle with night scenes, and a language tool trained mainly on formal writing may find slang or informal messages difficult to interpret.

When assessing a claim, it is worth asking whether the data reflects the real world situations in which the system will be used. Narrow or unrepresentative data can limit how well the system performs in real situations.

Recognising Limitations

All systems have limitations, and responsible companies acknowledge them.

It is helpful to look for information about situations where the system performs poorly, where it may misinterpret inputs, or where it may produce incorrect or misleading results. A voice assistant that mishears a request because of background noise is a simple example of how small changes in context can affect performance.

Balanced descriptions usually include both strengths and known limitations.

Avoiding Human-like Descriptions of AI

Marketing language sometimes presents artificial intelligence in ways that resemble human thinking.

Words such as "understands", "reasons", or "knows" can create an impression that the system possesses abilities it does not have. A system that predicts the next word in a sentence may appear to "understand" the topic, but it is following patterns rather than forming ideas.

A more accurate approach is to focus on what the system actually does, how it processes inputs, how it generates outputs, and how it behaves under different conditions.

Seeking Independent Validation

Independent evaluations often provide a clearer picture of how a system performs.

When researchers, regulators, journalists, or external auditors have examined a system, their findings provide a valuable counterbalance to promotional material.

Real world deployment is equally important. A navigation app may work perfectly in a staged demonstration, but everyday use can involve roadworks, poor signal, or unexpected detours that reveal weaknesses.

Genuine reliability is shown through consistent performance with diverse users and unpredictable inputs.

Considering the Consequences of Error

It is important to consider the consequences of error. Some tasks are low risk, while others involve significant personal, financial, or social impact.

A system used for entertainment can tolerate occasional mistakes. A music recommendation that misses the mark is usually harmless.

A system used for medical advice, financial decisions, or legal interpretation requires far stronger evidence and clear safeguards. A symptom checker that offers an overly confident suggestion illustrates how errors can matter more in high stakes settings.

The impact of errors can vary widely, so the way a system handles mistakes often shapes how it should be used.

The Importance of Transparency

Transparency and accountability are essential qualities.

Companies who provide clear explanations, publish evaluation results, describe limitations, and offer channels for feedback demonstrate a commitment to responsible practice.

Greater transparency makes it easier to understand how a system works and how its results should be interpreted. For example, a tool that explains which factors influenced a recommendation gives users a clearer sense of how to interpret the output.

A Practical Way to Judge a Claim

These themes often lead people to consider questions about what is being promised, what evidence supports it, and how the system behaves in real conditions.

It is useful to ask what is being promised, what evidence supports the promise, who carried out the evaluation, what data was used, what limitations are acknowledged, whether the system has been tested independently, how it performs outside controlled demonstrations, and what the consequences are if it fails.

This is a long list, but systems powered in some way by artificial intelligence are becoming more common and tehy are having a larger impact on everyday life.o

If we are all better placed to evaluate AI-based systems, the better.

If several of these questions cannot be answered, any claim is possibly likely to be overstated.

Conclusion

Artificial intelligence is a powerful set of technologies, but it is not magic.

Careful consideration and evaluation makes it easier to distinguish genuine progress from exaggerated claims.

How to Evaluate Claims Made About an AI-based System
Related Work
Table of Contents

Phroneses.com

Why Junior Engineers Matter More as AI Expands

The Adaptation of the Junior Engineer in an AI‑Accelerated Profession

The Changing Weight of the Work

AI Introduces New Types of Failure

The Organisational Obligation

Emerging Responsibilities

Failure‑Mode Literacy

Evaluating LLM output

Schema reliability

Instruction adherence

Grounding fidelity

Deterministic stability

Compliance and Safety

Creation vs Integration

The Apprenticeship Model Returns

A New Path to Seniority

The Cultural Shift

Practical First Steps for Juniors

Practical First Steps for Leaders

The Evolving Value of the Junior Engineer

Final Thoughts

Related Work

Table of Contents

When Urgency is High but Progress is Slow

When urgency rises faster than progress

Before You Adopt AI in Engineering, Answer These Five Questions

Executive Summary

What This Is Not

The Problem in One Sentence

AI Adoption Maturity Model

Stage 0 — Experimentation

Stage 1 — Unmanaged Individual Use

Stage 2 — Team‑Level Awareness

Stage 3 — Organisational Alignment

Stage 4 — Integrated AI Engineering

Stage 5 — Organisational Redesign

Common Misdiagnoses

Five Essential Questions for Engineering and Executive Leadership

1. What AI use already exists, and which maturity stage does it actually represent?

2. Where does AI reduce cognitive load or cycle time for whole teams, not just individuals?

3. What controls, review steps, and boundaries are required at our current stage?

4. Which organisational foundations must be strengthened before we can safely move to the next stage?

5. How will leadership set expectations and pace adoption so it matches our capacity to absorb change?

Leadership Imperative

If You Only Do One Thing

Related Work

Further Reading

Agents Cannot Maintain Systems: The Additive–Transformative Gap in LLM Software Delivery

The Promise of Automated Software Delivery

What the Labs Have Actually Delivered

Why is this?

Persistent state creates temporal dependencies

Writing code to Agentic Systems: The Fundamental Gap

Producing a PR‑ready diff (the section in question)

What can I do?

Why this matters: code is cheap, judgement is not

Final Thought

Related Work

Table of Contents

Further Reading

When Code Is Cheap, Judgement Matters More

SDD Is a Symptom, not a Methodology

What is new

SDD Surfaces When Teams Confront Ambiguity

Write a spec, get the code for free?

The Limits of the "Spec → Code" Argument

Well engineered code cannot be seen

Juniors are Not Doomed

When Code Becomes Cheap

Related Work

Table of Contents

Further Reading

The Missing Structure Agile Cannot Fix

Agile Is Not Enough: Delivery Is a Network

1. Agile’s Place in the Structure

2. What Agile Actually Covers

3. The Delivery Network

4. Why Agile Cannot Fix Structural Problems

5. What Agile Does Not Cover