<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Phroneses.com</title><link href="https://phroneses.com/" rel="alternate"></link><link href="https://phroneses.com/feeds/all.atom.xml" rel="self"></link><id>https://phroneses.com/</id><updated>2026-05-27T00:00:00+00:00</updated><entry><title>Why Junior Engineers Matter More as AI Expands</title><link href="https://phroneses.com/articles/build/notes/why-junior-engineers-matter-more.html" rel="alternate"></link><published>2026-05-27T00:00:00+00:00</published><updated>2026-05-27T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-27:/articles/build/notes/why-junior-engineers-matter-more.html</id><summary type="html">&lt;p&gt;Junior engineers evolve toward judgement, verification, and system awareness as AI absorbs the mechanical act of coding.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="the-adaptation-of-the-junior-engineer-in-an-aiaccelerated-profession"&gt;The Adaptation of the Junior Engineer in an AI‑Accelerated Profession&lt;/h1&gt;
&lt;p&gt;The landscape has shifted. AI can generate code at a pace that would have been
unthinkable a few years ago, but speed is not the work.&lt;/p&gt;
&lt;p&gt;Speed cannot decide what should exist, why it matters, or whether it is safe.
The belief that a junior can lean on AI and bypass the discipline is a
misreading of the craft.&lt;/p&gt;
&lt;p&gt;Early‑career engineers are needed more than ever because the judgement required
to guide, verify, and constrain AI now sits at the centre of the role.&lt;/p&gt;
&lt;p&gt;The junior position is not disappearing. It is being reshaped. AI has lowered
the cost of producing code, but it has raised the cost of understanding what
that code means. The work has not become smaller; it has become sharper, with
an additional focus.&lt;/p&gt;
&lt;p&gt;The organisations that recognise this early will keep their engineering
discipline intact. The ones that do not will discover that AI exposes
weaknesses in thinking faster than they can respond.&lt;/p&gt;
&lt;h2 id="the-changing-weight-of-the-work"&gt;The Changing Weight of the Work&lt;/h2&gt;
&lt;p&gt;Typing has never been the job. It was simply the visible part of it. The real
work — analysis, verification, risk thinking, system reasoning, and safety —
has always carried the weight. AI accelerates the mechanical layer and exposes
the cognitive one. Juniors now meet the deeper parts of the discipline sooner,
and the expectations rise accordingly.&lt;/p&gt;
&lt;p&gt;This shift is not cosmetic. It is economic. When code becomes cheap,
correctness becomes expensive. The cost of a faulty assumption, a missed
constraint, or a silent failure grows. The value of the junior engineer lies in
their ability to prevent these errors before they harden into production.&lt;/p&gt;
&lt;h3 id="ai-introduces-new-types-of-failure"&gt;AI Introduces New Types of Failure&lt;/h3&gt;
&lt;p&gt;When using an LLM in a pipeline, AI introduces new categories of failure:
output-level instability, and behavioural-level instability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Output-level Instability&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;LLMs are non-deterministic, probability machines.&lt;/p&gt;
&lt;p&gt;Because of this schema drift, hallucinations, and silent truncation of results,
can all ocur. The junior staff member will need to develop skills in detecting
and handling all of these. These are changes in the way the LLM might respond
to your system so your calling system must be robust to such variety.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Behavioural-level Instability&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Across multiple LLM calls, even if the shape of the output result is the same,
the behaviour of the LLM may change internally.&lt;/p&gt;
&lt;p&gt;Given an identical prompt, "Extract the customer’s job title", and the same
input, "My name is Helen and I work as a senior analyst at JPMG", the first
call may return "senior analyst", the second may return "analyst", and the
third may return "Senior Analyst".&lt;/p&gt;
&lt;p&gt;In this case, all data passed to the LLM (the prompt and the input) and the
output schema (a string in each case) remain the same. However, a change in
the LLM’s internal behaviour has produced different outputs. Juniors need to
be attuned to this possibility and know how to address it.&lt;/p&gt;
&lt;h2 id="the-organisational-obligation"&gt;The Organisational Obligation&lt;/h2&gt;
&lt;p&gt;None of this works if organisations cling to the old model. Juniors cannot
develop judgement in an old environment optimised for throughput. They need
structured mentorship, slower reviews, and the psychological safety to test
their reasoning.&lt;/p&gt;
&lt;p&gt;Juniors need decision‑rights that are clear, not implied. Decision-rights are
an understanding between the junior and their colleagues on what they can decide
for themselves, and what they cannot and must seek input to resolve.&lt;/p&gt;
&lt;p&gt;Juniors need leaders who understand that judgement is not taught by accident.&lt;/p&gt;
&lt;p&gt;If the system does not adapt, the junior cannot.&lt;/p&gt;
&lt;h2 id="emerging-responsibilities"&gt;Emerging Responsibilities&lt;/h2&gt;
&lt;p&gt;The adapted junior role becomes more investigative and more integrative. The
work stretches across definition, verification, safety, and coherence.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Problem framing becomes central. Before any code is generated, the junior
  and their team must be clear on what the business is trying to achieve.&lt;/li&gt;
&lt;li&gt;Constraint recognition grows in importance. Boundaries, risks, and
  compliance obligations must be surfaced early.&lt;/li&gt;
&lt;li&gt;AI‑guided exploration replaces manual iteration. The junior evaluates
  options rather than producing them from scratch.&lt;/li&gt;
&lt;li&gt;Verification discipline becomes essential. Plausible output is not enough.
  It must be correct, safe, and aligned with intent. AI can generate as much code
  as you want. But is it the right code? Determining whether generated code is the
  right code is part of the junior's role, supported by their team, the development
  process and wider engineering leadership.&lt;/li&gt;
&lt;li&gt;Integration awareness develops sooner. Systems fail at the seams, not in
  isolation. The junior must develop skills to be aware of this and build
  solutions that are hardened to failure.&lt;/li&gt;
&lt;li&gt;Operational literacy becomes expected. Logs, metrics, observability, and
  incident handling enter the junior toolkit.&lt;/li&gt;
&lt;li&gt;Documentation clarity gains weight. Decisions must be legible and
  reproducible. "The AI did it" is not a defence.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Should your organisation invoke an LLM as part of a processing pipeline,
token-level reasoning becomes a topic that needs addressing. Even with an
identical prompt, an LLM's internal behaviour may vary unless steps are taken
to constrain &lt;em&gt;temperature&lt;/em&gt;, &lt;em&gt;top-p&lt;/em&gt;, and &lt;em&gt;top-k&lt;/em&gt;. However, even if these values
are set to 0, 0, and 1 respectively (so that the LLM chooses the
highest-probability next token), the quality of the response may decrease. This
decrease is due to multiple factors: the LLM becoming overly literal when
processing the prompt, and becoming less robust to ambiguous input. The LLM may
fail on a task requiring synthesis or nuance as these require variety over the
next token, not always the highest‑probability one.&lt;/p&gt;
&lt;p&gt;These responsibilities demand human judgement. AI cannot supply it.&lt;/p&gt;
&lt;h2 id="failuremode-literacy"&gt;Failure‑Mode Literacy&lt;/h2&gt;
&lt;p&gt;Engineering maturity is measured by how you handle failure, not how quickly you
produce output. Juniors must learn to read failure modes: what breaks, why it
breaks, and how the system behaves under stress.&lt;/p&gt;
&lt;p&gt;This is where judgement is forged.&lt;/p&gt;
&lt;h2 id="evaluating-llm-output"&gt;Evaluating LLM output&lt;/h2&gt;
&lt;p&gt;Both output-level and behaviour-level instability require your junior to learn
the discipline of evaluating model behaviour, not just observing it.&lt;/p&gt;
&lt;p&gt;LLM output must be tested for schema reliability, instruction adherence,
grounding fidelity, and deterministic stability.  Behaviour must be measured
over time so that drift is detected early rather than discovered in production.&lt;/p&gt;
&lt;p&gt;Evaluation becomes part of the junior role because correctness is now the
expensive part of the work. AI accelerates your ability to produce code, so
humans must strengthen verification.&lt;/p&gt;
&lt;p&gt;Juniors often see AI‑generated artefacts first, which means they become the
first line of defence against drift, hallucination, and structural failure.&lt;/p&gt;
&lt;p&gt;The junior role is not shrinking, it is moving closer to the centre of the
system.&lt;/p&gt;
&lt;h2 id="schema-reliability"&gt;Schema reliability&lt;/h2&gt;
&lt;p&gt;Schema reliability is the stability of the output structure across calls. It
asks whether the model returns the same shape every time. A reliable schema
preserves field names, nesting, ordering, and types. When schema reliability is
weak, downstream systems break: parsers fail, validators reject output, and
silent truncation corrupts results. Juniors must learn to detect when the
structure shifts, even subtly, because structural instability will cause
production failure.&lt;/p&gt;
&lt;h2 id="instruction-adherence"&gt;Instruction adherence&lt;/h2&gt;
&lt;p&gt;Instruction adherence is the model’s ability to follow the constraints it was
given. It measures whether the output respects required fields, forbidden
content, formatting expectations, safety constraints, and domain‑specific
rules. Weak adherence produces plausible but incorrect output that appears
compliant but violates intent. Juniors must learn to test adherence explicitly,
because LLMs often drift away from constraints under load, ambiguity, or long
contexts.&lt;/p&gt;
&lt;h2 id="grounding-fidelity"&gt;Grounding fidelity&lt;/h2&gt;
&lt;p&gt;Grounding fidelity is the degree to which the model’s output remains anchored
to the provided context, data, or retrieval results. High fidelity means the
model stays within the evidence; low fidelity means it fabricates, embellishes,
or substitutes. This is the core defence against hallucination. Juniors must
learn to check whether each claim in the output can be traced back to a source.
Without grounding fidelity, correctness becomes guesswork and organisational
risk increases.&lt;/p&gt;
&lt;h2 id="deterministic-stability"&gt;Deterministic stability&lt;/h2&gt;
&lt;p&gt;Deterministic stability is the consistency of the model’s behaviour under
identical conditions. It measures whether repeated calls with the same prompt,
same context, and same parameters produce meaningfully similar results.
Instability here signals deeper behavioural drift: model updates, sampling
variance, context‑window rollover, or upstream nondeterminism. Juniors must
learn to monitor this stability because unpredictable behaviour, even within a
fixed schema, undermines trust in the system.&lt;/p&gt;
&lt;p&gt;Once evaluation becomes routine, the next layer of responsibility emerges.
Understanding how AI‑driven behaviour interacts with organisational risk,
regulation, and safety boundaries becomes a concern.&lt;/p&gt;
&lt;h2 id="compliance-and-safety"&gt;Compliance and Safety&lt;/h2&gt;
&lt;p&gt;AI introduces new liabilities. Licensing, data handling, regulatory
expectations, model hallucinations, and architecture all sit inside the
junior’s world now.  The business must help them to learn to recognise unsafe
output and understand the organisational risk attached to it. Secure by default
is no longer a slogan; it is a habit.&lt;/p&gt;
&lt;p&gt;Once an LLM becomes part of your production pipeline, it represents a
system-level reliability concern. Junior colleagues will need to understand
retrieval hops, orchestration cost, and architectural latency.&lt;/p&gt;
&lt;h2 id="creation-vs-integration"&gt;Creation vs Integration&lt;/h2&gt;
&lt;p&gt;Many teams still confuse "using a chatbot to generate new code" with "running
an LLM inside a production pipeline". These are not the same problem: the
former accelerates creation, while the latter introduces system‑level
reliability concerns that juniors must learn to evaluate.&lt;/p&gt;
&lt;p&gt;But even chatbot‑generated code is not free. It must still be evaluated to
answer the question: "is adding this code into our system the right thing to
do?"&lt;/p&gt;
&lt;p&gt;The distinction matters because both activities demand judgement, but pipeline
integration demands system‑level reasoning and reliability awareness.&lt;/p&gt;
&lt;h2 id="the-apprenticeship-model-returns"&gt;The Apprenticeship Model Returns&lt;/h2&gt;
&lt;p&gt;AI compresses the early stages of skill acquisition because the novice to
intermediate gap is mostly about knowledge access, pattern exposure, and basic
scaffolding.&lt;/p&gt;
&lt;p&gt;A novice must learn vocabulary, syntax, idioms, and the shape of common
solutions ("house rules"). An LLM can supply this information instantly: it
provides examples, explanations, and templates on demand. This removes much of
the friction that traditionally slows early progress, so with AI the distance
between novice and intermediate shrinks.&lt;/p&gt;
&lt;p&gt;But the intermediate to senior gap is not reduced, because seniority is not a
knowledge problem. It is a judgement problem formed through apprenticeship:
pairing, review, reflection, and exposure to real events on real systems under
real constraints.&lt;/p&gt;
&lt;p&gt;Senior engineers develop taste, trade‑off literacy, failure intuition, and a
sense of responsibility for long‑term consequences. These abilities cannot be
acquired through text prediction alone. They come from lived experience with
real systems, real failures, and real organisational pressures.&lt;/p&gt;
&lt;p&gt;AI accelerates learning, but senior judgement is produced by responsibility,
constraint, and lived experience. These are conditions that AI cannot inhabit.
The craft remains intact because the essence of mastery is grounded in practice
shaped by real systems, real failures, and real organisational pressures, not
by information alone.&lt;/p&gt;
&lt;p&gt;Juniors must learn the difference between additive work (generating new code),
and transformative work (modifying existing systems). To transform an existing
system &lt;em&gt;safely&lt;/em&gt; requires judgement. Your organisation will need to support your
junior colleague in developing that judgement given your company's unique
codebase, infrastructure and culture.&lt;/p&gt;
&lt;h2 id="a-new-path-to-seniority"&gt;A New Path to Seniority&lt;/h2&gt;
&lt;p&gt;Seniority emerges from judgement, not keystrokes. The route to senior for the
junior shifts toward structure, risk, and operational thinking.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Architecture literacy develops earlier. Patterns and constraints become
  part of daily reasoning.&lt;/li&gt;
&lt;li&gt;Risk thinking becomes foundational. Engineers learn to anticipate failure
  and design for recovery.&lt;/li&gt;
&lt;li&gt;Review competence shifts from syntax to structure. The question becomes:
  does this code make sense?&lt;/li&gt;
&lt;li&gt;Operational competence becomes core. Observability and incident handling
  help to shape judgement.&lt;/li&gt;
&lt;li&gt;Decision clarity becomes a differentiator. Seniors articulate reasoning,
  not just outcomes.&lt;/li&gt;
&lt;li&gt;Cross‑functional communication grows in importance. Complexity must be
  translated into clarity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Juniors are ideally placed to contribute to AI-augmented team processes:
reviewing AI-generated artefacts, maintaining team-level shared understanding,
and helping to ensure coherence across accelerated workflows.&lt;/p&gt;
&lt;p&gt;The work becomes less about producing code and more about shaping the conditions
in which code can be trusted.&lt;/p&gt;
&lt;h2 id="the-cultural-shift"&gt;The Cultural Shift&lt;/h2&gt;
&lt;p&gt;High‑pace environments often reward noise. AI accelerates that tendency. But the
teams that thrive will be the ones that reward clarity instead. Juniors need a
culture that values slow thinking at the right moments, not constant motion.&lt;/p&gt;
&lt;p&gt;Expectations of juniors will vary depending on the AI‑maturity of your
organisation.&lt;/p&gt;
&lt;p&gt;In low‑maturity environments, juniors are forced to compensate for weak
processes, unclear decision‑rights, and inconsistent use of AI.&lt;/p&gt;
&lt;p&gt;In high‑maturity environments, juniors grow faster because the system around
them is stable: prompts are versioned, retrieval is predictable, evaluation is
routine, and model updates are treated as engineering events. The culture
determines whether AI becomes an accelerant for judgement or a multiplier of
confusion.&lt;/p&gt;
&lt;h2 id="practical-first-steps-for-juniors"&gt;Practical First Steps for Juniors&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Learn to articulate intent before touching a tool.  &lt;/li&gt;
&lt;li&gt;Practise verifying AI output with suspicion and skepticism, not trust.  &lt;/li&gt;
&lt;li&gt;Build small systems and observe how they behave under load.  &lt;/li&gt;
&lt;li&gt;Document decisions as if someone else must rely on them.  &lt;/li&gt;
&lt;li&gt;Study failure modes; they teach more than success ever will.  &lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="practical-first-steps-for-leaders"&gt;Practical First Steps for Leaders&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Define decision‑rights explicitly. What can a junior decide for themself? &lt;/li&gt;
&lt;li&gt;Slow down reviews to create space for reasoning.  &lt;/li&gt;
&lt;li&gt;Pair juniors with seniors intentionally, not incidentally.  &lt;/li&gt;
&lt;li&gt;Treat AI as an accelerator, but only within well‑understood and defined boundaries.  &lt;/li&gt;
&lt;li&gt;Build a culture where clarity is rewarded and noise is not.  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AI is a tool. How can you best use that tool to help the junior do their best
work? AI is not a replacement for the junior but an assistant.&lt;/p&gt;
&lt;h2 id="the-evolving-value-of-the-junior-engineer"&gt;The Evolving Value of the Junior Engineer&lt;/h2&gt;
&lt;p&gt;Juniors become force multipliers. They use AI to explore the solution space,
stress‑test assumptions, and verify generated artefacts. They learn system
thinking earlier and contribute meaningfully sooner. But only if the
organisation supports them.&lt;/p&gt;
&lt;p&gt;Ask not what your junior can do for you — ask what you can do for your junior.&lt;/p&gt;
&lt;h2 id="final-thoughts"&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Engineering is not being erased. It is being reweighted. Humans decide what
should exist, why it matters, and whether it is safe. AI writes the code. The
profession continues to evolve, but its centre of gravity remains the same:
judgement, clarity, and the ability to read systems before safely changing
them.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="agents-cannot-maintain-systems.html"&gt;LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="engineers-need-to-know.html"&gt;Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai.html"&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="/articles/build/notes/agents-cannot-maintain-systems.html"&gt;LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="/articles/build/notes/ai-engineering-must-be-team-based-to-see-significant-roi-for-engineers.html"&gt;The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="/articles/build/notes/software-engineers-need-to-know.html"&gt;Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-adaptation-of-the-junior-engineer-in-an-aiaccelerated-profession"&gt;The Adaptation of the Junior Engineer in an AI‑Accelerated Profession&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-changing-weight-of-the-work"&gt;The Changing Weight of the Work&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#ai-introduces-new-types-of-failure"&gt;AI Introduces New Types of Failure&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-organisational-obligation"&gt;The Organisational Obligation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#emerging-responsibilities"&gt;Emerging Responsibilities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#failuremode-literacy"&gt;Failure‑Mode Literacy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#evaluating-llm-output"&gt;Evaluating LLM output&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#schema-reliability"&gt;Schema reliability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#instruction-adherence"&gt;Instruction adherence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#grounding-fidelity"&gt;Grounding fidelity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#deterministic-stability"&gt;Deterministic stability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#compliance-and-safety"&gt;Compliance and Safety&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#creation-vs-integration"&gt;Creation vs Integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-apprenticeship-model-returns"&gt;The Apprenticeship Model Returns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-new-path-to-seniority"&gt;A New Path to Seniority&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-cultural-shift"&gt;The Cultural Shift&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#practical-first-steps-for-juniors"&gt;Practical First Steps for Juniors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#practical-first-steps-for-leaders"&gt;Practical First Steps for Leaders&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-evolving-value-of-the-junior-engineer"&gt;The Evolving Value of the Junior Engineer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#final-thoughts"&gt;Final Thoughts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="build"></category></entry><entry><title>When Urgency is High but Progress is Slow</title><link href="https://phroneses.com/articles/leadership/notes/when-urgency-is-high.html" rel="alternate"></link><published>2026-05-26T00:00:00+00:00</published><updated>2026-05-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-26:/articles/leadership/notes/when-urgency-is-high.html</id><summary type="html">&lt;p&gt;A clear view of why leaders feel rising ambiguity and how structured judgement restores clarity without leadership abstractions.&lt;/p&gt;</summary><content type="html">&lt;h1 id="when-urgency-rises-faster-than-progress"&gt;When urgency rises faster than progress&lt;/h1&gt;
&lt;p&gt;Leaders often find themselves in a situation where urgency keeps increasing but
progress does not follow. The pace is high, the pressure is real, yet the work
feels harder to move forward. This is not a failure of intent. It is a sign
that the operating conditions around the leader have shifted in ways that are
not immediately visible.&lt;/p&gt;
&lt;p&gt;Do you recognise this in your own environment? The symptoms are familiar:
unclear ownership, AI‑driven noise, delivery friction, and teams struggling to
make sound decisions at speed. These pressures do not call for more effort or
inspiration. They call for structure, judgement, and operating clarity that can
be applied tomorrow.&lt;/p&gt;
&lt;p&gt;The thinking behind phroneses is built for this reality. It treats leadership as
a system: decision‑rights, flow, constraints, and the conditions that allow
teams to move with confidence when complexity rises. This is not a framework or
a slogan. It is a way of seeing the organisation that makes the next step
clearer and the work easier to lead.&lt;/p&gt;
&lt;p&gt;When leaders adopt this way of thinking, the effect is immediate. Noise reduces.
Decisions sharpen. Ownership becomes clearer. Progress becomes steadier because
the system becomes easier to understand and easier to shape.&lt;/p&gt;
&lt;p&gt;As this clarity strengthens, the role of leadership becomes clearer too. The
energy shifts from reacting to pressure toward creating the conditions that
allow teams to thrive. That is where your real leverage sits, and where you
will have the most impact.&lt;/p&gt;</content><category term="leadership"></category></entry><entry><title>Before You Adopt AI in Engineering, Answer These Five Questions</title><link href="https://phroneses.com/articles/leadership/notes/five-questions.html" rel="alternate"></link><published>2026-05-24T00:00:00+00:00</published><updated>2026-05-24T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-24:/articles/leadership/notes/five-questions.html</id><summary type="html">&lt;p&gt;Most organisations think they are maturing in AI, but their workflows tell a different story. These five questions give engineering leaders a clear, stage‑aligned way to understand their real maturity and scale AI safely.&lt;/p&gt;</summary><content type="html">&lt;h1 id="executive-summary"&gt;Executive Summary&lt;/h1&gt;
&lt;p&gt;AI is already reshaping your delivery workflows, whether you see it or not.
If you do not lead it, it will reshape them badly. This article gives executives
a stage‑aligned diagnostic to identify their real maturity, expose hidden risks,
and steer AI adoption with intent rather than drift.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="what-this-is-not"&gt;What This Is Not&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Not a hype piece&lt;/li&gt;
&lt;li&gt;Not a vendor framework&lt;/li&gt;
&lt;li&gt;Not a technical guide&lt;/li&gt;
&lt;li&gt;Not a generic AI playbook&lt;/li&gt;
&lt;li&gt;Not a promise of productivity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is a leadership instrument for understanding and directing AI adoption.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="the-problem-in-one-sentence"&gt;The Problem in One Sentence&lt;/h1&gt;
&lt;p&gt;Most organisations believe they are progressing in AI; their workflows show they
are still in unmanaged use.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="ai-adoption-maturity-model"&gt;AI Adoption Maturity Model&lt;/h1&gt;
&lt;p&gt;Curiosity → Ad‑hoc → Uncoordinated → Stabilisation → Integration → Reconfiguration&lt;/p&gt;
&lt;p&gt;Each stage includes:
- Stage signal: what you see
- Failure mode: what breaks if you stay here
- Leadership responsibility: what executives must do&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="stage-0-experimentation"&gt;Stage 0 — Experimentation&lt;/h2&gt;
&lt;p&gt;Stage signal: Small groups test AI tools in isolation; nothing links to delivery.&lt;br/&gt;
Failure mode: No patterns survive; no organisational learning occurs.&lt;br/&gt;
Leadership responsibility: Do not mistake curiosity for capability. If you stay
here, AI adoption will happen without you.&lt;/p&gt;
&lt;h2 id="stage-1-unmanaged-individual-use"&gt;Stage 1 — Unmanaged Individual Use&lt;/h2&gt;
&lt;p&gt;Stage signal: Engineers use AI daily but invisibly; quality drifts; no review.&lt;br/&gt;
Failure mode: Shadow workflows reshape delivery without oversight.&lt;br/&gt;
Leadership responsibility: Surface usage and risk before anything scales. If you
stay here, quality and security will drift invisibly.&lt;/p&gt;
&lt;h2 id="stage-2-teamlevel-awareness"&gt;Stage 2 — Team‑Level Awareness&lt;/h2&gt;
&lt;p&gt;Stage signal: Teams feel friction: uneven output, duplicated prompts, unclear fixes.&lt;br/&gt;
Failure mode: Teams believe they are maturing; leaders believe it even more.&lt;br/&gt;
Leadership responsibility: Establish boundaries and shared expectations. If you
stay here, teams will burn time managing friction instead of delivering.&lt;/p&gt;
&lt;h2 id="stage-3-organisational-alignment"&gt;Stage 3 — Organisational Alignment&lt;/h2&gt;
&lt;p&gt;Stage signal: Workflows stabilise; AI review stages and documentation improve.&lt;br/&gt;
Failure mode: Premature scaling without observability or constraints.&lt;br/&gt;
Leadership responsibility: Standardise workflows and measure impact. If you stay
here, AI will outgrow your controls.&lt;/p&gt;
&lt;h2 id="stage-4-integrated-ai-engineering"&gt;Stage 4 — Integrated AI Engineering&lt;/h2&gt;
&lt;p&gt;Stage signal: AI is a system component with constraints, observability, governance.&lt;br/&gt;
Failure mode: Drift and quality collapse if leadership attention drops.&lt;br/&gt;
Leadership responsibility: Maintain discipline; treat AI as infrastructure.&lt;/p&gt;
&lt;h2 id="stage-5-organisational-redesign"&gt;Stage 5 — Organisational Redesign&lt;/h2&gt;
&lt;p&gt;Stage signal: Processes, roles, and flow reshape around AI‑accelerated work.&lt;br/&gt;
Failure mode: Redesign without stability leads to chaos.&lt;br/&gt;
Leadership responsibility: Rebuild systems deliberately, not reactively.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="common-misdiagnoses"&gt;Common Misdiagnoses&lt;/h1&gt;
&lt;p&gt;Executives repeatedly misread their organisation’s maturity in predictable ways:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Mistaking Stage 1 for Stage 3  &lt;/li&gt;
&lt;li&gt;Mistaking individual speed for organisational capability  &lt;/li&gt;
&lt;li&gt;Mistaking experimentation for adoption  &lt;/li&gt;
&lt;li&gt;Mistaking friction for progress  &lt;/li&gt;
&lt;li&gt;Mistaking tool usage for system change  &lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If any of these appear familiar, your organisation is exposed to silent quality
drift, security risk, and delivery incoherence.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="five-essential-questions-for-engineering-and-executive-leadership"&gt;Five Essential Questions for Engineering and Executive Leadership&lt;/h1&gt;
&lt;p&gt;These questions are the diagnostic. If you cannot answer one cleanly, you are
not at the stage you think you are.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="1-what-ai-use-already-exists-and-which-maturity-stage-does-it-actually-represent"&gt;1. What AI use already exists, and which maturity stage does it actually represent?&lt;/h2&gt;
&lt;p&gt;Stage signal:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0–1: Usage is invisible, individual, unreviewed&lt;/li&gt;
&lt;li&gt;2: Teams feel friction but cannot coordinate&lt;/li&gt;
&lt;li&gt;3+: Workflows, review steps, and boundaries are explicit&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Executive signal:
If you cannot see AI use, you cannot govern it. Invisible use is the most
dangerous form of adoption because it reshapes delivery without review or audit.&lt;/p&gt;
&lt;p&gt;Leadership action:
Surface all usage, tools, risks, and drift before scaling anything.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="2-where-does-ai-reduce-cognitive-load-or-cycle-time-for-whole-teams-not-just-individuals"&gt;2. Where does AI reduce cognitive load or cycle time for whole teams, not just individuals?&lt;/h2&gt;
&lt;p&gt;Stage signal:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0–1: Productivity is anecdotal and personal&lt;/li&gt;
&lt;li&gt;2: Teams see uneven output and duplicated effort&lt;/li&gt;
&lt;li&gt;3: Shared workflows show measurable improvement&lt;/li&gt;
&lt;li&gt;4–5: AI contributes to throughput as part of the system&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Executive signal:
Individual acceleration is not organisational capability. Individual use without
team coherence increases delivery variance.&lt;/p&gt;
&lt;p&gt;Leadership action:
Identify where AI improves team‑level flow; ignore individual anecdotes.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="3-what-controls-review-steps-and-boundaries-are-required-at-our-current-stage"&gt;3. What controls, review steps, and boundaries are required at our current stage?&lt;/h2&gt;
&lt;p&gt;Stage signal:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0–1: No guardrails; risk accumulates quietly&lt;/li&gt;
&lt;li&gt;2: Teams ask for boundaries but cannot define them&lt;/li&gt;
&lt;li&gt;3: Review steps and constraints become standardised&lt;/li&gt;
&lt;li&gt;4: Governance and observability are built into the system&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Executive signal:
Scaling without controls guarantees failure. Missing controls at Stage 1 allows
unreviewed changes into critical workflows.&lt;/p&gt;
&lt;p&gt;Leadership action:
Match controls to your actual stage, not your aspirations.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="4-which-organisational-foundations-must-be-strengthened-before-we-can-safely-move-to-the-next-stage"&gt;4. Which organisational foundations must be strengthened before we can safely move to the next stage?&lt;/h2&gt;
&lt;p&gt;Stage signal:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0–2: Documentation, testing, ownership, architecture inconsistent&lt;/li&gt;
&lt;li&gt;3: Foundations stabilise because AI workflows depend on them&lt;/li&gt;
&lt;li&gt;4–5: Strong foundations multiply value; weak ones collapse instantly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Executive signal:
AI amplifies whatever environment it enters. Weak foundations are already being
stressed by AI‑accelerated work.&lt;/p&gt;
&lt;p&gt;Leadership action:
Ensure the environment is AI‑compatible: clarity, ownership, documentation,
testing, and architecture must be strong enough to absorb AI‑accelerated change.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="5-how-will-leadership-set-expectations-and-pace-adoption-so-it-matches-our-capacity-to-absorb-change"&gt;5. How will leadership set expectations and pace adoption so it matches our capacity to absorb change?&lt;/h2&gt;
&lt;p&gt;Stage signal:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0–1: Expectations inflated; progress invisible&lt;/li&gt;
&lt;li&gt;2: Teams feel strain; leaders misread friction as maturity&lt;/li&gt;
&lt;li&gt;3: Communication grounded in measurable workflows&lt;/li&gt;
&lt;li&gt;4–5: AI adoption becomes organisational change, not tooling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Executive signal:
Most organisations believe they are at Stage 3 while operating at Stage 1–2.
Pacing is a leadership responsibility, not a technical one.&lt;/p&gt;
&lt;p&gt;Leadership action:
Set expectations that match reality; pace adoption deliberately.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="leadership-imperative"&gt;Leadership Imperative&lt;/h1&gt;
&lt;p&gt;AI adoption is already happening inside your organisation. Your only choice is
whether it reshapes your workflows with structure or erodes quality, coherence,
and trust without it.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="if-you-only-do-one-thing"&gt;If You Only Do One Thing&lt;/h1&gt;
&lt;p&gt;Identify your true maturity stage. Everything else depends on that.&lt;/p&gt;
&lt;hr/&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="/articles/leadership/notes/ai-engineering-must-be-team-based-to-see-significant-roi.html"&gt;AI Engineering Must Be Team‑Based to See Significant ROI&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="/articles/leadership/notes/building-safe-llm-systems.html"&gt;Building Safe, Compliant, and Sustainable LLM Systems&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="/articles/leadership/notes/transforming.html"&gt;Transforming Your Business for AI&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;
&lt;hr/&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;McKinsey — The state of AI: How organizations are rewiring to capture value (2025)&lt;br/&gt;
  https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-how-organizations-are-rewiring-to-capture-value&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;OECD Digital Economy Outlook 2024 (Volume 1)&lt;br/&gt;
  https://www.oecd.org/en/publications/oecd-digital-economy-outlook-2024-volume-1_a1689dc5-en.html&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="leadership"></category></entry><entry><title>Agents Cannot Maintain Systems: The Additive–Transformative Gap in LLM Software Delivery</title><link href="https://phroneses.com/articles/build/notes/agents-cannot-maintain-systems.html" rel="alternate"></link><published>2026-05-21T00:00:00+00:00</published><updated>2026-05-21T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-21:/articles/build/notes/agents-cannot-maintain-systems.html</id><summary type="html">&lt;p&gt;LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.&lt;/p&gt;</summary><content type="html">&lt;p&gt;This article explains why current LLMs cannot safely modify real software
systems, despite impressive code‑generation demos.&lt;/p&gt;
&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="the-promise-of-automated-software-delivery"&gt;The Promise of Automated Software Delivery&lt;/h1&gt;
&lt;p&gt;In 2026, the automated software delivery dream is for an agent to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;read a repository&lt;/li&gt;
&lt;li&gt;understand project structure&lt;/li&gt;
&lt;li&gt;plan a multi‑step change&lt;/li&gt;
&lt;li&gt;write code, tests, and docs&lt;/li&gt;
&lt;li&gt;run the code and fix its own mistakes&lt;/li&gt;
&lt;li&gt;produce a PR‑ready diff&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first three tasks are additive; the last three are transformative. The
first three add information without changing the behaviour of the system: they
require reading, mapping, and planning, but not altering any existing causal
structure in the codebase.&lt;/p&gt;
&lt;p&gt;Applying new code is self-contained, additive work; modifying an existing system
is transformative work that requires an understanding of dependencies,
invariants, and consequences.  This distinction — additive vs transformative —
is the core reason current LLMs can assist but cannot autonomously deliver
software.&lt;/p&gt;
&lt;p&gt;Parts of the above can be done but only for tightly controlled demos on simple
code that is tens of lines long, not on real-world repositories with thousands
of lines of code that has existed for years where dozens of people have
updated it.&lt;/p&gt;
&lt;h1 id="what-the-labs-have-actually-delivered"&gt;What the Labs Have Actually Delivered&lt;/h1&gt;
&lt;p&gt;The agentic work of OpenAI, Google, Cognition Labs, GitHub (Microsoft),
Sourcegraph, JetBrains, Replit, Amazon, Meta, and Anthropic, that is listed in
&lt;a href="#further_reading"&gt;Further Reading&lt;/a&gt;, was published in 2023 and 2024.&lt;/p&gt;
&lt;p&gt;Depending on where you look, you may have been given another impression: that
"agents are here". However, reality tells a different story.&lt;/p&gt;
&lt;p&gt;Agents are improving, but are not reliable, not autonomous, and not production‑safe.&lt;/p&gt;
&lt;p&gt;LLMs can assist with software delivery, but they cannot own it.&lt;/p&gt;
&lt;h1 id="why-is-this"&gt;Why is this?&lt;/h1&gt;
&lt;p&gt;LLMs generate statistically plausible continuations of text. This works well
for self-contained tasks like writing a function or drafting documentation
because these are pattern‑extension problems. But pattern‑matching is not
system understanding, and plausibility is not correctness.&lt;/p&gt;
&lt;p&gt;Software systems are causal: components depend on each other, invariants
constrain behaviour, and changes propagate through the system. The moment a
task stops being self‑contained and becomes system‑dependent — requiring
dependency coherence, persistent state, or awareness of how changes ripple
through a real codebase — pattern‑matching is no longer sufficient.&lt;/p&gt;
&lt;p&gt;Currently, LLMs can imitate the shape of engineering work, but they cannot
maintain a stable internal representation of a system that must be coherently
changed, and that gap is exactly why LLMs fail the moment the task becomes
system‑level.&lt;/p&gt;
&lt;h1 id="persistent-state-creates-temporal-dependencies"&gt;Persistent state creates temporal dependencies&lt;/h1&gt;
&lt;p&gt;A self‑contained task has no past and no future.  A system‑dependent task does.&lt;/p&gt;
&lt;p&gt;As soon as a change depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;previous writes&lt;/li&gt;
&lt;li&gt;accumulated data&lt;/li&gt;
&lt;li&gt;cached values&lt;/li&gt;
&lt;li&gt;long‑lived objects&lt;/li&gt;
&lt;li&gt;external system state&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;any agentic model must reason about how the system got here and how it will
behave after the change.&lt;/p&gt;
&lt;p&gt;LLMs cannot maintain that internal causal chain.&lt;/p&gt;
&lt;h1 id="writing-code-to-agentic-systems-the-fundamental-gap"&gt;Writing code to Agentic Systems: The Fundamental Gap&lt;/h1&gt;
&lt;p&gt;The gap becomes clear when you compare two activities: writing new code and
modifying an existing system.&lt;/p&gt;
&lt;p&gt;Code generation is local and additive: the model extends a pattern without
needing to understand the system.&lt;/p&gt;
&lt;p&gt;But agentic work is global and transformative: the LLM must change the system
itself, which requires understanding dependencies, invariants, interactions,
and downstream consequences.&lt;/p&gt;
&lt;p&gt;This is causal reasoning, not pattern extension.  LLMs predict tokens, not
consequences — and that is why the leap from writing code to producing a safe,
system‑aware PR‑ready diff is not incremental but a shift into a fundamentally
different problem space.&lt;/p&gt;
&lt;h1 id="producing-a-prready-diff-the-section-in-question"&gt;Producing a PR‑ready diff (the section in question)&lt;/h1&gt;
&lt;p&gt;A pull request (PR) is a piece of code that will change a system.&lt;/p&gt;
&lt;p&gt;For that change to be safe, the change must respect the system's current
architecture, its intent, and all downstream consequences.&lt;/p&gt;
&lt;p&gt;Software engineers work hard to ensure that such a change is safe through
testing and their own judgement and experience before having a collegue review
the change.&lt;/p&gt;
&lt;p&gt;Applying a change is no longer pattern-matching but understanding causal
behaviour: how will the system change if this PR is applied?&lt;/p&gt;
&lt;p&gt;The correctness of the PR depends on understanding the whole system, not just
generating text.&lt;/p&gt;
&lt;p&gt;The LLM must change the system, which requires understanding dependencies,
invariants, interactions and consequences, all of which demand causal
reasoning, not pattern matching.&lt;/p&gt;
&lt;p&gt;Pattern‑matching can write code; only causal reasoning can maintain systems.&lt;/p&gt;
&lt;h1 id="what-can-i-do"&gt;What can I do?&lt;/h1&gt;
&lt;p&gt;Confirm for yourself any claim that you see. Define your own &lt;em&gt;realistic&lt;/em&gt;
real-world repository to work on, one that is thousands of lines of code, that
has supported past real-world work patterns.&lt;/p&gt;
&lt;p&gt;Having your own results, applied to your own repository will tell you volumes
more than any press release or online anecdote.&lt;/p&gt;
&lt;p&gt;For the moment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;treat agentic AI as a strategic direction&lt;/li&gt;
&lt;li&gt;treat current tools as assistants, not engineers&lt;/li&gt;
&lt;li&gt;invest in clarity, architecture, and test discipline&lt;/li&gt;
&lt;li&gt;expect progress, but not miracles&lt;/li&gt;
&lt;li&gt;do not plan delivery pipelines around unproven capabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Maintain human judgement as the centre of the system.&lt;/p&gt;
&lt;p&gt;The dream is intact.  The evidence is not yet here.&lt;/p&gt;
&lt;h1 id="why-this-matters-code-is-cheap-judgement-is-not"&gt;Why this matters: code is cheap, judgement is not&lt;/h1&gt;
&lt;p&gt;LLM-augmented software delivery does not remove engineering.&lt;/p&gt;
&lt;p&gt;It moves engineering up a level.&lt;/p&gt;
&lt;p&gt;Humans need to focus on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;intent&lt;/li&gt;
&lt;li&gt;constraints&lt;/li&gt;
&lt;li&gt;architecture&lt;/li&gt;
&lt;li&gt;correctness&lt;/li&gt;
&lt;li&gt;safety&lt;/li&gt;
&lt;li&gt;trade‑offs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The desired end state is not "AI writes code" but AI maintains systems. If we get
there, humans will still need to maintain intent.&lt;/p&gt;
&lt;p&gt;The consequence of an agentic system is not to &lt;em&gt;remove&lt;/em&gt; engineering, but to
&lt;em&gt;elevate&lt;/em&gt; it, so that teams spend less time on mechanical construction and more time on
judgement, alignment, and shaping the environment in which agents operate.&lt;/p&gt;
&lt;p&gt;The organisations that benefit most will be those that treat agentic development
not as automation, but as a structural shift in how software is conceived,
validated, and maintained.&lt;/p&gt;
&lt;h1 id="final-thought"&gt;Final Thought&lt;/h1&gt;
&lt;p&gt;Until AI can reason causally about systems, human judgement remains the
foundation of software delivery.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="ai-engineering-team-based-ai.html"&gt;The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="engineers-need-to-know.html"&gt;Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="surface-area.html"&gt;AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-promise-of-automated-software-delivery"&gt;The Promise of Automated Software Delivery&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-the-labs-have-actually-delivered"&gt;What the Labs Have Actually Delivered&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#why-is-this"&gt;Why is this?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#persistent-state-creates-temporal-dependencies"&gt;Persistent state creates temporal dependencies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#writing-code-to-agentic-systems-the-fundamental-gap"&gt;Writing code to Agentic Systems: The Fundamental Gap&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#producing-a-prready-diff-the-section-in-question"&gt;Producing a PR‑ready diff (the section in question)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-can-i-do"&gt;What can I do?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#why-this-matters-code-is-cheap-judgement-is-not"&gt;Why this matters: code is cheap, judgement is not&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#final-thought"&gt;Final Thought&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a id="further_reading"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;OpenAI o1/o3&lt;/strong&gt;, OpenAI, September, 2024&lt;br/&gt;
- https://openai.com/index/introducing-openai-o1-preview/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gemini Code Demos&lt;/strong&gt;, Google, December, 2023&lt;br/&gt;
- https://blog.google/technology/ai/google-gemini-ai/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Devin&lt;/strong&gt;, Cognition Labs, March, 2024&lt;br/&gt;
- https://www.cognition-labs.com/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GitHub Copilot&lt;/strong&gt;, GitHub (Microsoft), November, 2023&lt;br/&gt;
- https://github.blog/2023-11-08-the-new-github-copilot-your-ai-pair-programmer/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cody&lt;/strong&gt;, Sourcegraph, April, 2024&lt;br/&gt;
- https://sourcegraph.com/blog/cody-2-0&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI Assistant in JetBrains IDEs&lt;/strong&gt;, JetBrains, December, 2023&lt;br/&gt;
- https://blog.jetbrains.com/blog/2023/12/06/jetbrains-ai-assistant-is-now-available/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Replit Agents&lt;/strong&gt;, Replit, November, 2023&lt;br/&gt;
- https://blog.replit.com/agents&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Amazon CodeWhisperer&lt;/strong&gt;, Amazon, April, 2023&lt;br/&gt;
- https://aws.amazon.com/codewhisperer/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Code Llama&lt;/strong&gt;, Meta, August, 2023&lt;br/&gt;
- https://ai.meta.com/blog/code-llama-large-language-model-coding/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claude 3 Code Reasoning&lt;/strong&gt;, Anthropic, March, 2024&lt;br/&gt;
- https://www.anthropic.com/news/claude-3-family&lt;/p&gt;</content><category term="build"></category></entry><entry><title>When Code Is Cheap, Judgement Matters More</title><link href="https://phroneses.com/articles/leadership/notes/when-code-is-cheap.html" rel="alternate"></link><published>2026-05-20T00:00:00+00:00</published><updated>2026-05-20T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-20:/articles/leadership/notes/when-code-is-cheap.html</id><summary type="html">&lt;p&gt;AI lowers the cost of code, not the cost of thinking. Clarity and judgement, not speed, determine whether teams build what truly matters.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="sdd-is-a-symptom-not-a-methodology"&gt;SDD Is a Symptom, not a Methodology&lt;/h1&gt;
&lt;p&gt;Getting software delivered has always required a specification.&lt;/p&gt;
&lt;p&gt;Having a clear specification of what is required is essential.&lt;/p&gt;
&lt;p&gt;Writing such a spec is a collaborative effort:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Product owns the business intent&lt;/li&gt;
&lt;li&gt;Engineering owns the technical constraints&lt;/li&gt;
&lt;li&gt;Design owns the interaction and behaviour&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The spec is a shared artefact formed through deliberate thinking and judgement.
It must embody strategy and confirm that what is to be built is relevant.&lt;/p&gt;
&lt;p&gt;The software industry now suggests that having a specification will make
AI tooling more reliable. No. And this is not new.&lt;/p&gt;
&lt;p&gt;A clear spec has always meant that the outcome is &lt;em&gt;more likely&lt;/em&gt; to be successful.&lt;/p&gt;
&lt;p&gt;SDD for AI-augmented teams is just a 30-year-old idea in a sparkly jacket.&lt;/p&gt;
&lt;h1 id="what-is-new"&gt;What is new&lt;/h1&gt;
&lt;p&gt;SDD is not new. But the context is.&lt;/p&gt;
&lt;p&gt;SDD is being reframed as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a way to generate code from structured specs&lt;/li&gt;
&lt;li&gt;a way to constrain AI agents&lt;/li&gt;
&lt;li&gt;a way to reduce non‑determinism&lt;/li&gt;
&lt;li&gt;a way to enforce governance in AI‑augmented pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This reframing gives the impression that SDD is a new discipline rather than a
new label for long‑standing engineering practice.&lt;/p&gt;
&lt;p&gt;The spec is not the goal. Working software is.&lt;/p&gt;
&lt;p&gt;Regardless of who writes the spec, you will need to iterate: build, release,
gather user and market feedback, and steer with additional thinking and
judgement.&lt;/p&gt;
&lt;h1 id="sdd-surfaces-when-teams-confront-ambiguity"&gt;SDD Surfaces When Teams Confront Ambiguity&lt;/h1&gt;
&lt;p&gt;SDD appears when teams realise:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;their requirements are too vague&lt;/li&gt;
&lt;li&gt;their systems are too implicit&lt;/li&gt;
&lt;li&gt;their data contracts are too loose&lt;/li&gt;
&lt;li&gt;their AI tooling is too unpredictable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;SDD is the label people reach for when they need clarity, structure and determinism.&lt;/p&gt;
&lt;p&gt;You do not need SDD. You need clarity, structure and determinism.&lt;/p&gt;
&lt;h1 id="write-a-spec-get-the-code-for-free"&gt;Write a spec, get the code for free?&lt;/h1&gt;
&lt;p&gt;The assumption in tech currently seems to be, write a spec, feed it into an AI
and get out all the code you need for free.&lt;/p&gt;
&lt;p&gt;Writing the spec requires deliberate thinking and judgement by Product,
Engineering, and Design. You cannot automate this.&lt;/p&gt;
&lt;h1 id="the-limits-of-the-spec-code-argument"&gt;The Limits of the "Spec → Code" Argument&lt;/h1&gt;
&lt;p&gt;Taking the "spec → code" argument to its logical conclusion: why not use AI to
automate the generation of the spec? Why stop at generating code? We could use
AI to generate the company's vision and strategy so vision → strategy → spec →
code can be AI generated?&lt;/p&gt;
&lt;p&gt;Because large language models are probabilistic pattern-matching processes,
domains that are less pattern rich than the unambiguous grammar of a computer
programming language or a mathematical formula will be less well modeled by an
LLM.&lt;/p&gt;
&lt;p&gt;In 2026, LLMs are experiencing major leaps forward since the initial revolution
started, but over time, the incremental improvements and the size of the leap
forward will lessen as all the low-hanging innovation fruit is quickly
consumed, and we realise the fundamental limits of pattern matching.&lt;/p&gt;
&lt;h1 id="well-engineered-code-cannot-be-seen"&gt;Well engineered code cannot be seen&lt;/h1&gt;
&lt;p&gt;"Marley was dead: to begin with."&lt;/p&gt;
&lt;p&gt;These six words start A Christmas Carol by Charles Dickens. And what they
achieve is beyond just the words.&lt;/p&gt;
&lt;p&gt;Dickens uses the line to establish an absolute fact the reader must accept,
because the entire supernatural and moral structure of the story depends on
Marley being unquestionably dead. Without that certainty, the ghost would not
be a ghost, Scrooge’s transformation would lose force, and the story’s logic
would collapse. The sentence subtly fixes the rules of the world before the
plot begins.&lt;/p&gt;
&lt;p&gt;Well engineered code is the same; it embodies a team's judgement beyond the
text that can be seen.&lt;/p&gt;
&lt;p&gt;To capture every eventuality in a specification would require anticipating
everything. Humans are not good at this, which is why incremental delivery is
essential.&lt;/p&gt;
&lt;p&gt;We forget that any sufficiently detailed spec is the code.&lt;/p&gt;
&lt;p&gt;In addition, code executes within a much larger environment.  Aligning code to
work within a changing environment requires judgement from across the
organisation, not only from engineering.&lt;/p&gt;
&lt;h1 id="juniors-are-not-doomed"&gt;Juniors are Not Doomed&lt;/h1&gt;
&lt;p&gt;Before LLMs, a junior software engineer would traditionally have been given a
task that was self-contained: fixing bugs or delivering straightforward
features.  This reduced the risk to the business and ensured that the engineer
could get up to speed with house rules: how code was delivered; what to expect
from a PR; who to seek help from. &lt;/p&gt;
&lt;p&gt;This familiarisation is part of the 70% of the job. The junior will use their
judgement, with feedback, to contribute to the understanding that product,
engineering and design collaborate to achieve. This is how the junior engineer
learns and gains experience by doing the whole software engineering cycle,
end-to-end.&lt;/p&gt;
&lt;p&gt;With large language models, the 30% of the job is likely to change. But the 70%
will remain the same. The 70% cannot be fully automated by LLMs as it requires
judgement.&lt;/p&gt;
&lt;p&gt;Good engineering is more than what you can see in the code. Marley may be dead
but the role of the junior is not.&lt;/p&gt;
&lt;h1 id="when-code-becomes-cheap"&gt;When Code Becomes Cheap&lt;/h1&gt;
&lt;p&gt;AI is now part of software engineering. The question is not whether we use it,
but whether we use it well.&lt;/p&gt;
&lt;p&gt;Writing the code is the last step once the team has gained a good understanding
of what is required. Without clarity, our current use of AI is to produce more
code that is not needed or will not be used.&lt;/p&gt;
&lt;p&gt;If AI makes the cost of writing code essentially zero, we need to ensure that
the code that is written is exactly what is required for the business, given
the singular context of the business within its market.&lt;/p&gt;
&lt;p&gt;The quick win for AI companies has been to demonstrate how suited their LLMs
are to code generation.  But like any tool, its value depends entirely on how
we choose to use it.&lt;/p&gt;
&lt;p&gt;A business should not define itself by how much code can be generated but by
the quality of its products; leadership must recognise that rushing out large
quantities of code will dilute that quality.&lt;/p&gt;
&lt;p&gt;Leadership should focus on clarity, structure and determinism so that the
product being designed and built is what the organisation genuinely needs.&lt;/p&gt;
&lt;p&gt;If AI reduces the cost of producing code, leadership must raise the standard of
what is worth producing. The responsibility for clarity increases as the cost
of execution falls.&lt;/p&gt;
&lt;p&gt;AI changes the economics of code, not the fundamentals of engineering.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="ai-engineering-team-based-ai.html"&gt;The biggest ROI from AI comes from improving team‑level work, not speeding up individual coding.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="team-ai-is-the-next-step.html"&gt;Individual AI delivers diminishing returns; meaningful improvement comes from strengthening the collective workflow.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="transforming.html"&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#sdd-is-a-symptom-not-a-methodology"&gt;SDD Is a Symptom, not a Methodology&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-is-new"&gt;What is new&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sdd-surfaces-when-teams-confront-ambiguity"&gt;SDD Surfaces When Teams Confront Ambiguity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#write-a-spec-get-the-code-for-free"&gt;Write a spec, get the code for free?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-limits-of-the-spec-code-argument"&gt;The Limits of the "Spec → Code" Argument&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#well-engineered-code-cannot-be-seen"&gt;Well engineered code cannot be seen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#juniors-are-not-doomed"&gt;Juniors are Not Doomed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#when-code-becomes-cheap"&gt;When Code Becomes Cheap&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A Christmas Carol, Charles Dickens
  https://en.wikipedia.org/wiki/A_Christmas_Carol&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Allen Holub on LinkedIn, A post starting &lt;em&gt;At the top of the "are doomed to repeat it" category&lt;/em&gt;...
  https://shorturl.at/fWndU&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="leadership"></category></entry><entry><title>The Missing Structure Agile Cannot Fix</title><link href="https://phroneses.com/articles/leadership/notes/the-missing-structure.html" rel="alternate"></link><published>2026-05-19T00:00:00+00:00</published><updated>2026-05-19T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-19:/articles/leadership/notes/the-missing-structure.html</id><summary type="html">&lt;p&gt;Agile cannot fix structural gaps; delivery depends on clear ownership, boundaries, and decision‑rights across the wider organisational network.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="agile-is-not-enough-delivery-is-a-network"&gt;Agile Is Not Enough: Delivery Is a Network&lt;/h1&gt;
&lt;p&gt;Agile is not the missing layer. &lt;strong&gt;Structural clarity is.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Agile is one part of a larger system. Software delivery behaves like a network,
and that network depends on structure. When ownership, boundaries, and
decision‑rights are unclear, signals drift and intent loses its path. Structural
clarity is what allows the whole system to function with purpose rather than
friction. Agile is one part of that system.&lt;/p&gt;
&lt;p&gt;Structural clarity means defining who owns what, who decides what, and where
each team’s authority begins and ends. These are the elements that give the
network shape.&lt;/p&gt;
&lt;p&gt;Modern delivery is a set of interconnected nodes carrying intent, decisions,
and constraints. When the structure is weak, the network compensates through
effort instead of design. Teams work harder, not faster. Progress slows.&lt;/p&gt;
&lt;p&gt;You have seen this pattern. Stand‑ups increase, backlogs are refined, reporting
expands, yet progress slows. This is not something engineering teams can fix on
their own. The slowdown comes from missing links in the network. Signals do not
flow, decisions do not propagate, and intent cannot reach the places that need
it.&lt;/p&gt;
&lt;p&gt;A familiar scenario illustrates the point. Delivery begins to slip. Leaders
assume the issue sits within engineering, so the response is to "do Agile
better": tighten ceremonies, rewrite backlogs, add coaches, increase cadence.&lt;/p&gt;
&lt;p&gt;But the intended fix does not work because the problem is not at the team
level. Strategy is unclear, ownership is fragmented, and decision‑rights are
undefined. Agile cannot compensate for structural gaps. The method is sound;
the layer above it is not.&lt;/p&gt;
&lt;p&gt;Without defined pathways, even strong teams stall.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="1-agiles-place-in-the-structure"&gt;1.  Agile’s Place in the Structure&lt;/h2&gt;
&lt;p&gt;Software delivery is a system of interdependent functions: strategy, product,
architecture, engineering, risk, governance, and operations.&lt;/p&gt;
&lt;p&gt;Agile supports one part of this system (engineering), but it cannot replace the
structural clarity that allows the &lt;em&gt;whole network&lt;/em&gt; to function.&lt;/p&gt;
&lt;p&gt;Agile supports the engineering team‑level execution node of the delivery network.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Iteration&lt;/li&gt;
&lt;li&gt;Local planning and prioritisation&lt;/li&gt;
&lt;li&gt;Team‑level coordination and communication&lt;/li&gt;
&lt;li&gt;Short feedback loops&lt;/li&gt;
&lt;li&gt;Making work visible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Teams and leaders that rely on Agile alone eventually discover that the real
issues sit above the methodology.&lt;/p&gt;
&lt;p&gt;This is consistent with the &lt;em&gt;Agile Manifesto&lt;/em&gt;, which never claimed to define an
organisational model.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="2-what-agile-actually-covers"&gt;2. What Agile Actually Covers&lt;/h2&gt;
&lt;p&gt;Agile was designed for a narrow and valuable purpose: to help teams work
iteratively, plan locally, maintain short feedback loops, and keep work
visible. Agile excels at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Iteration  &lt;/li&gt;
&lt;li&gt;Team‑level coordination  &lt;/li&gt;
&lt;li&gt;Local prioritisation  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are important behaviours, but they do not define the structure of the
wider delivery network. Agile does not establish ownership, define
decision‑making, architectural boundaries, or cross‑team interfaces.&lt;/p&gt;
&lt;p&gt;The &lt;em&gt;Scrum Guide&lt;/em&gt; reinforces this: Scrum is a lightweight framework for
&lt;em&gt;team‑level&lt;/em&gt; delivery, not an organisational blueprint.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="3-the-delivery-network"&gt;3. The Delivery Network&lt;/h2&gt;
&lt;p&gt;Delivery is a network of connected disciplines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Strategy sets direction.  &lt;/li&gt;
&lt;li&gt;Product defines value.  &lt;/li&gt;
&lt;li&gt;Architecture shapes boundaries.  &lt;/li&gt;
&lt;li&gt;Engineering execution turns intent into working systems.  &lt;/li&gt;
&lt;li&gt;Quality assurance verifies behaviour, protects quality, and prevents regressions.&lt;/li&gt;
&lt;li&gt;DevOps automates delivery, helps to accelerate flow, and connects build to run.&lt;/li&gt;
&lt;li&gt;Risk and governance ensure safety and compliance.  &lt;/li&gt;
&lt;li&gt;Platform operations keep the environment stable.  &lt;/li&gt;
&lt;li&gt;Organisational clarity ties these layers together.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These functions fail not in isolation, but at their intersections. The issue is
the structure between them, not any one discipline.&lt;/p&gt;
&lt;p&gt;Agile touches only one node in this network (engineering execution). The rest
require structure, ownership, and judgement.&lt;/p&gt;
&lt;p&gt;As &lt;em&gt;Team Topologies&lt;/em&gt; argues, flow depends more on team boundaries,
communication paths, and interaction modes than on any single methodology.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="4-why-agile-cannot-fix-structural-problems"&gt;4. Why Agile Cannot Fix Structural Problems&lt;/h2&gt;
&lt;p&gt;A familiar failure mode appears across organisations.&lt;/p&gt;
&lt;p&gt;A team is asked to deliver a critical change. Strategy is ambiguous.
Architecture is drifting. No one owns the interface between two systems that
must integrate. Risk has not defined acceptable limits. Governance expects
updates but has not clarified decision-rights.&lt;/p&gt;
&lt;p&gt;The team runs sprints, holds stand‑ups, and updates its work board.&lt;br/&gt;
But nothing moves.&lt;/p&gt;
&lt;p&gt;The network is miswired. Agile cannot repair the topology.&lt;/p&gt;
&lt;p&gt;This is the same lesson illustrated in &lt;em&gt;The Phoenix Project&lt;/em&gt;: local
team optimisation cannot compensate for system‑level dysfunction.&lt;/p&gt;
&lt;p&gt;Agile works at the team-level, whereas issues are at the level above.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="5-what-agile-does-not-cover"&gt;5. What Agile Does Not Cover&lt;/h2&gt;
&lt;p&gt;Agile influences parts of the system, but it does not define them. It does
&lt;strong&gt;not&lt;/strong&gt; cover:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Operating model design  &lt;/li&gt;
&lt;li&gt;Decision-rights  &lt;/li&gt;
&lt;li&gt;Ownership boundaries  &lt;/li&gt;
&lt;li&gt;Architectural coherence  &lt;/li&gt;
&lt;li&gt;Risk posture  &lt;/li&gt;
&lt;li&gt;Budgeting and portfolio management  &lt;/li&gt;
&lt;li&gt;Hiring and capability development  &lt;/li&gt;
&lt;li&gt;Cross‑team alignment  &lt;/li&gt;
&lt;li&gt;Quality engineering  &lt;/li&gt;
&lt;li&gt;Capacity planning  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These responsibilities sit above the delivery team. They require leadership,
not ceremonies.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="6-the-missing-layer-structural-clarity"&gt;6. The Missing Layer: Structural Clarity&lt;/h2&gt;
&lt;p&gt;The missing layer is structural clarity. Organisations need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Clear ownership  &lt;/li&gt;
&lt;li&gt;Clear decision‑making  &lt;/li&gt;
&lt;li&gt;Clear constraints  &lt;/li&gt;
&lt;li&gt;Clear operating models  &lt;/li&gt;
&lt;li&gt;Clear interfaces between teams  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These elements create the conditions in which Agile can work as intended.
Without them, Agile becomes noise layered on top of confusion.&lt;/p&gt;
&lt;p&gt;This mirrors the argument in &lt;em&gt;Good Strategy / Bad Strategy&lt;/em&gt;: clarity, coherence,
and focus matter more than any specific process.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="7-how-the-network-behaves-when-structure-exists"&gt;7 How the Network Behaves When Structure Exists&lt;/h2&gt;
&lt;p&gt;When organisations define structural clarity, the network changes character.
Ownership becomes visible. Decisions move without friction. Boundaries stop
shifting. Teams know where their responsibility ends and another begins.
Cross‑team work relies on defined interfaces rather than personal negotiation.
Flow improves because intent and decisions no longer leak between gaps in the
structure. Agile starts to work as intended, not because the method changed,
but because the environment finally supports it.&lt;/p&gt;
&lt;p&gt;The deeper shift is cultural. Slowdowns are no longer treated as engineering
problems. Teams stop compensating through effort. Leaders stop reaching for
Agile process as the universal fix. The organisation begins to behave like a
system rather than a collection of disconnected parts.&lt;/p&gt;
&lt;p&gt;Structural clarity does not make teams better. It removes the conditions that
force them to work against the system.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="8-conclusion"&gt;8. Conclusion&lt;/h2&gt;
&lt;p&gt;Agile is not wrong. It is incomplete.&lt;br/&gt;
Software delivery requires clarity, structure, and judgement. Agile is a
component.  &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clarity is the network.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Before assuming Agile is the problem, ask one question:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the network around the team structured well enough for any methodology to
work at all.&lt;/strong&gt;&lt;/p&gt;
&lt;div style="background:#e8f4ff; border-left:4px solid #7bb6f0; padding:1rem; margin:2rem 0;"&gt;
  For a deeper explanation of the structural layer that Agile depends on, see the &lt;a href="/leadership-os/"&gt;Leadership OS guide&lt;/a&gt;.
&lt;/div&gt;
&lt;hr/&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="transforming.html"&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="ai-engineering-team-based-ai.html"&gt;The biggest ROI from AI comes from improving team‑level work, not speeding up individual coding.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="tech-executives.html"&gt;Executives must treat LLMs as probabilistic systems requiring controls, governance, and new forms of oversight.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#agile-is-not-enough-delivery-is-a-network"&gt;Agile Is Not Enough: Delivery Is a Network&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#1-agiles-place-in-the-structure"&gt;1. Agile’s Place in the Structure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-what-agile-actually-covers"&gt;2. What Agile Actually Covers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-the-delivery-network"&gt;3. The Delivery Network&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-why-agile-cannot-fix-structural-problems"&gt;4. Why Agile Cannot Fix Structural Problems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-what-agile-does-not-cover"&gt;5. What Agile Does Not Cover&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-the-missing-layer-structural-clarity"&gt;6. The Missing Layer: Structural Clarity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-how-the-network-behaves-when-structure-exists"&gt;7 How the Network Behaves When Structure Exists&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-conclusion"&gt;8. Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#agile-and-flow"&gt;Agile and Flow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-structure-and-systems-thinking"&gt;Team Structure and Systems Thinking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#strategy-and-organisational-clarity"&gt;Strategy and Organisational Clarity&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;hr/&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;h2 id="agile-and-flow"&gt;Agile and Flow&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The Agile Manifesto &lt;br/&gt;
  https://agilemanifesto.org/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Scrum Guide&lt;br/&gt;
  https://scrumguides.org/&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="team-structure-and-systems-thinking"&gt;Team Structure and Systems Thinking&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Skelton, Matthew; Pais, Manuel. Team Topologies.&lt;br/&gt;
  https://teamtopologies.com/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Kim, Gene; Behr, Kevin; Spafford, George. The Phoenix Project.&lt;br/&gt;
  https://itrevolution.com/the-phoenix-project/&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="strategy-and-organisational-clarity"&gt;Strategy and Organisational Clarity&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Paul Griffin Consulting&lt;br/&gt;
  https://paulgriffinconsulting.co.uk/blog/good-strategy-bad-strategy-applying-rumelts-key-principles/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Rumelt, Richard. &lt;em&gt;Good Strategy / Bad Strategy.&lt;/em&gt;&lt;br/&gt;
  https://goodstrategybadstrategy.com/&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="leadership"></category></entry><entry><title>Designing Prompts for Modern AI Systems</title><link href="https://phroneses.com/articles/foundations/notes/designing-prompts-for-modern-ai-systems.html" rel="alternate"></link><published>2026-05-11T00:00:00+00:00</published><updated>2026-05-11T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-11:/articles/foundations/notes/designing-prompts-for-modern-ai-systems.html</id><summary type="html">&lt;p&gt;Modern AI systems require structured, multi‑step prompts that guide planning, critique, and long‑context reasoning.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;AI in 2026 demands more from you than simple instructions. Modern systems can
plan, critique, revise, and work across long context windows. They are no
longer moved by vague guidance such as "be clear" or "add detail". They need a
defined environment to operate within.&lt;/p&gt;
&lt;p&gt;Modern prompting is about shaping the system, not decorating the request. When
you set the frame, the workflow, and the output contract, the model gains the
structure it needs to behave predictably. You do this once, and the benefits
carry through every answer. You set the constraints. The model works inside
them on your behalf.&lt;/p&gt;
&lt;p&gt;If you do this, just once, your AI output will be steady and structured, and
you will find it much quicker and easier to work with. When you tell the AI
how to respond, you apply guardrails for the system to work within. Guardrails
set by you, not the AI.&lt;/p&gt;
&lt;h2 id="1-start-with-the-system-not-the-request"&gt;1. Start with the system, not the request&lt;/h2&gt;
&lt;p&gt;AI has advanced quickly. Its answers can now be broad, deep, and varied. To
keep that power under control, you begin by defining the frame the model must
work within. This frame sets the role, the tone, the limits, and the rules for
handling uncertainty. It is the foundation the rest of the prompt stands on.&lt;/p&gt;
&lt;p&gt;Most prompt failures do not come from unclear questions. They come from the
model having no stable footing. Without a frame, the AI will guess at how
formal to be, how cautious to be, and how much structure to use. Those guesses
shift from run to run, which leads to drift and inconsistency.&lt;/p&gt;
&lt;p&gt;A system frame removes that guesswork. It tells the model what it is, how it
should behave, and what matters most. It defines what is in scope, what is out
of scope, and how to respond when the request touches the edges. With this in
place, the rest of the prompt becomes lighter and more reliable.&lt;/p&gt;
&lt;p&gt;The frame does not need flourish. It needs clarity, discipline, and a steady
tone. With that foundation, the model behaves less like a pattern generator
and more like a tool working inside a defined brief.&lt;/p&gt;
&lt;p&gt;In practice, the system frame is the architecture behind the output. It does
not need flourish or personality. It needs to state the role, the rules, and
your expectations.&lt;/p&gt;
&lt;div class="chat-example" style="
    background:#f5f5f5;
    border:1px solid #ddd;
    border-radius:8px;
    padding:1rem 1.2rem;
    margin:1.2rem 0;
  "&gt;
&lt;p&gt;&lt;strong&gt;SYSTEM FRAME&lt;/strong&gt;&lt;br/&gt;
  You are an analytical engine. You work with steady reasoning, cautious
  claims, and plain structure. When the request is unclear, you pause and ask
  for what is missing. You avoid invention and keep within the boundaries set
  for you.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TASK&lt;/strong&gt;&lt;br/&gt;
  Summarise the key points from the supplied text in three short sections.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OUTPUT CONTRACT&lt;/strong&gt;&lt;br/&gt;
  Produce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Context&lt;/li&gt;
&lt;li&gt;Reasoning&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Rules:&lt;/strong&gt;&lt;br/&gt;
  If the request is ambiguous, list interpretations and ask for
  clarification.&lt;br/&gt;
  If information is missing, state what is missing before answering.&lt;br/&gt;
  Do not invent facts.&lt;br/&gt;
  Keep the final answer concise and structured.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;WORKFLOW&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Identify assumptions.&lt;/li&gt;
&lt;li&gt;Plan the answer.&lt;/li&gt;
&lt;li&gt;Produce the answer.&lt;/li&gt;
&lt;li&gt;Critique it for clarity and accuracy.&lt;/li&gt;
&lt;li&gt;Produce a revised final version.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;p&gt;The AI is told "You are an analytical engine" as that gives the model a
defined role to work from. Without a role, the model guesses at how formal to
be, how cautious to be, and how much structure to use. A simple line such as
"You are an analytical engine" sets the tone and keeps the behaviour plain,
steady, and predictable. It avoids personality, avoids flourish, and keeps the
work focused on reasoning rather than style.&lt;/p&gt;
&lt;p&gt;If you do not supply the role, the AI will provide one; and that one will vary,
creating work for you.&lt;/p&gt;
&lt;p&gt;How to minimise the work you need to do and have the AI manage and apply the
prompt is dealt with in the section &lt;a href="#ai-manage-prompt"&gt;Having the AI Manage the Prompt
Template&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="2-define-the-output-contract"&gt;2. Define the output contract&lt;/h2&gt;
&lt;p&gt;Modern models behave more reliably when you specify the shape of the answer:
structure, scope, exclusions, formatting, and the rules for handling missing or
ambiguous information. This is far stronger than broad guidance such as "be
concise".&lt;/p&gt;
&lt;p&gt;When you define the output contract, you are not telling the model what to
think. You are telling it what form the answer must take. This removes a large
amount of guesswork. Modern systems have wide latitude in how they respond,
and if you do not narrow that down, they will choose a structure for you. That
choice will vary from run to run, which means more tidying and more checking
on your side.&lt;/p&gt;
&lt;p&gt;An output contract fixes the frame. It tells the model which sections to
produce, how to handle gaps, and how to behave when the request is unclear. It
also removes the temptation to drift into style, flourish, or padding. You are
giving the model the rails to run on.&lt;/p&gt;
&lt;p&gt;A good contract does four things. It sets the structure. It sets the limits.
It sets the rules for uncertainty. And it sets the standard for brevity. Once
these are in place, the model has far less room to wander. You get answers
that are steadier, easier to scan, easier to compare, and easier to work with.&lt;/p&gt;
&lt;p&gt;The contract also acts as a safeguard. By telling the model what to do when
information is missing, you prevent it from filling the gaps with invention.
By telling it how to behave when the request is ambiguous, you prevent it from
guessing. These two points alone remove a large share of common errors.&lt;/p&gt;
&lt;p&gt;In short, the output contract is the quiet discipline behind the work. It
keeps the model inside the brief, keeps the structure predictable, and keeps
the answer focused on what you asked for rather than what the model feels like
producing.&lt;/p&gt;
&lt;h2 id="3-use-decomposition-as-a-control-mechanism"&gt;3. Use decomposition as a control mechanism&lt;/h2&gt;
&lt;p&gt;Modern models already break tasks into steps, but the steps they choose may not
match the work you want done. Light guidance prevents the model from wandering
and keeps the task anchored to your brief.&lt;/p&gt;
&lt;p&gt;When you state the assumptions the model is allowed to make, you draw a clear
line between what is permitted and what is not. This stops the model from
filling empty spaces with guesses. Large models are inclined to complete
patterns, and if you do not show them where the firm ground ends, they will
supply their own footing.&lt;/p&gt;
&lt;p&gt;A natural extension of this is to make the model aware of what is missing.
Once the assumptions are set, the next step is to mark the gaps. This creates a
smooth handover from what the model may rely on to what it must not invent. By
pointing out missing information, you show the model where the edges of the
task sit. When the model knows what is absent, it is less likely to drift into
speculation or produce material that does not belong in the answer. You are
giving it a map of the gaps so it does not try to fill them on its own.&lt;/p&gt;
&lt;p&gt;Together, these two steps act as guardrails. They keep the work inside the
brief, reduce the chance of invention, and ensure that the model stays within
the limits you have set.&lt;/p&gt;
&lt;p&gt;You can also break the task into a simple chain such as understanding →
planning → execution. This mirrors what the model already does internally, but
it makes the process explicit. When the steps are explicit, the model is less
likely to skip ahead or solve the wrong problem.&lt;/p&gt;
&lt;p&gt;Breaking the interaction into smaller stages also helps with scope. By naming
the steps, you give the model a narrow lane to work in. It cannot jump to
conclusions, and it cannot pad the answer with material that does not serve
the task. The work stays tidy, and the output stays close to what you asked
for.&lt;/p&gt;
&lt;p&gt;In short, decomposition is a practical form of control. It does not restrict
the model’s ability to give a good answer, but it does restrict where the
model goes to supply that answer. This keeps the work steady, predictable, and
within scope, so that it remains relevant to what you are doing.&lt;/p&gt;
&lt;h2 id="4-add-a-self-critique-loop"&gt;4. Add a self-critique loop&lt;/h2&gt;
&lt;p&gt;Modern models benefit from a short cycle of controlled refinement. Once the
first version of the answer is produced, a brief review stage forces the model
to check its own work against the constraints you have set. This is not a call
for hidden reasoning. It is a prompt to tighten the output.&lt;/p&gt;
&lt;p&gt;A review step also encourages the model to correct small slips in structure,
scope, or tone. It is easier for the model to adjust an existing draft than to
produce a perfect answer in one pass. The revision stage gives it a second
chance to align with the brief.&lt;/p&gt;
&lt;p&gt;This process also reduces noise. When the model has been told that its work
will be checked and refined, it tends to produce cleaner first drafts. The
revision step becomes a light polish rather than a rescue job.&lt;/p&gt;
&lt;p&gt;In practice, this creates a steady rhythm: draft, inspect, refine. It keeps
the work within bounds and produces answers that are clearer, more accurate,
and easier for you to use.&lt;/p&gt;
&lt;h2 id="5-stack-roles-for-higher-quality-output"&gt;5. Stack roles for higher-quality output&lt;/h2&gt;
&lt;p&gt;Layered roles give you steadier output because each stage is handled by a
specialist rather than a single broad persona. Modern models respond well to
this division of labour. It narrows the scope of each step and reduces the
chance of drift away from what you want.&lt;/p&gt;
&lt;p&gt;A domain expert handles the substance. An editor handles clarity and structure.
A risk assessor checks for overreach, missing information, and unwarranted
certainty. A summariser produces a clean final version. Each role has a narrow
brief, which keeps the work tidy and keeps the answer aligned with the task.&lt;/p&gt;
&lt;p&gt;Here is an example prompt using layered roles:&lt;/p&gt;
&lt;div class="chat-example" style="
    background:#f5f5f5;
    border:1px solid #ddd;
    border-radius:8px;
    padding:1rem 1.2rem;
    margin:1.2rem 0;
  "&gt;
&lt;p&gt;&lt;strong&gt;ROLES&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Domain Expert&lt;/strong&gt;&lt;br/&gt;
  Provide the technical or factual core. Stay within verified information.
  State assumptions and mark gaps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Editor&lt;/strong&gt;&lt;br/&gt;
  Reshape the expert output into clear, plain structure. Remove padding.
  Ensure each section answers the brief.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Risk Assessor&lt;/strong&gt;&lt;br/&gt;
  Check for overreach, ambiguity, or missing information. Flag anything that
  exceeds the evidence. Recommend corrections.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Summariser&lt;/strong&gt;&lt;br/&gt;
  Produce a concise final version that reflects the corrections and stays
  within scope.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;WORKFLOW&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Domain Expert produces the initial draft.&lt;/li&gt;
&lt;li&gt;Editor restructures and clarifies it.&lt;/li&gt;
&lt;li&gt;Risk Assessor reviews for accuracy and limits.&lt;/li&gt;
&lt;li&gt;Summariser produces the final answer.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;OUTPUT CONTRACT&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Context&lt;/li&gt;
&lt;li&gt;Reasoning&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Rules&lt;/strong&gt;&lt;br/&gt;
  No invention. Mark missing information. Keep the answer within scope.
  Maintain plain structure.&lt;/p&gt;
&lt;/div&gt;
&lt;h2 id="6-treat-the-context-window-as-working-memory"&gt;6. Treat the context window as working memory&lt;/h2&gt;
&lt;p&gt;As of April 2026, modern models dedicate roughly 200,000 to 1,000,000 tokens to
representing your instructions. This space acts as working memory. It can hold
definitions, constraints, examples, running notes, previous outputs, and a
living brief. With this in place, the model behaves more like a stateful
collaborator than a stateless assistant.&lt;/p&gt;
&lt;p&gt;This working memory is what the model can track across prompts. When you define
what belongs in this state, you save time. You do not need to repeat your
requirements. The model carries them forward and maintains the structure you
set.&lt;/p&gt;
&lt;h2 id="7-use-agentic-prompting-patterns"&gt;7. Use agentic prompting patterns&lt;/h2&gt;
&lt;p&gt;Static prompts assume a fixed path from question to answer. Modern systems are
closer to small agents: they can plan, choose actions, call tools, and adjust
their output based on intermediate results. This is often called agentic
behaviour. The system selects and sequences actions to achieve an objective,
rather than following a single linear path.&lt;/p&gt;
&lt;p&gt;Giving the model a workflow such as Plan → Act → Observe → Revise makes this
explicit. In the planning phase, the model outlines what it intends to do,
which tools it may need, and what a good outcome looks like. In the action
phase, it carries out the steps, including any tool calls. In the observation
phase, it inspects the result against the plan and the constraints. In the
revision phase, it adjusts the answer and produces a clean final version.&lt;/p&gt;
&lt;p&gt;Using a workflow saves time and reduces the need for repeated corrections. The
final answer remains tidy. The planning and checking happen in the background
or in short, structured notes, while the output stays compact and readable.
You gain the benefit of step-by-step reasoning without having to sift through
a long chain of output.&lt;/p&gt;
&lt;p&gt;Tool use fits naturally into this pattern. In the Plan step, the model decides
whether tools are needed and why. In the Act step, it calls them. In the
Observe step, it checks whether the tool results answer the question. If tools
are not needed, the model should say so plainly and proceed with reasoning
instead of forcing a tool into the workflow.&lt;/p&gt;
&lt;p&gt;In this context, agentic means that the system behaves as a goal directed
process. The model can plan, choose among available capabilities, and adapt
its path based on intermediate results, rather than producing a single static
completion from a prompt.&lt;/p&gt;
&lt;h2 id="8-make-the-model-identify-ambiguity-before-answering"&gt;8. Make the model identify ambiguity before answering&lt;/h2&gt;
&lt;p&gt;One of the most effective techniques is to require the model to surface all
plausible interpretations before it attempts an answer. This forces the model
to slow down, map the possible meanings, and avoid locking itself into the
first pattern it detects. Large models tend to commit early unless guided.&lt;/p&gt;
&lt;p&gt;This step also exposes hidden ambiguity. When the model lists the possible
readings, you can see whether the task is underspecified, whether key terms
are unclear, or whether the scope could be read in more than one way. This
gives you a chance to correct the course before any work is done.&lt;/p&gt;
&lt;p&gt;If more than one interpretation exists, the model should ask for
clarification. This prevents mis-scoping, reduces the chance of error, and
removes the need for the model to guess. Guessing is where most drift begins.&lt;/p&gt;
&lt;p&gt;The technique also improves consistency. When the model is told to check for
multiple readings, it becomes less likely to produce answers that are
confident but misaligned. It treats ambiguity as a signal to pause rather than
a gap to fill.&lt;/p&gt;
&lt;p&gt;In practice, this turns ambiguity into a controlled step rather than a source
of error. The model identifies the forks in the road, confirms which path is
correct, and only then proceeds with the task.&lt;/p&gt;
&lt;p&gt;Doing this will save you a great deal of time.&lt;/p&gt;
&lt;h2 id="9-adapt-prompts-to-the-model"&gt;9. Adapt prompts to the model&lt;/h2&gt;
&lt;p&gt;Different models excel in different areas, and a good prompt acknowledges this
rather than assuming a single uniform capability. Some models are strongest at
structure: they produce clean sections, tidy formatting, and predictable
layouts. Others are stronger at reasoning: they handle multi step logic, edge
cases, and constraint checking with more stability. Some specialise in
compression: they can distil long material into tight summaries without losing
meaning. Others lean toward style: they generate fluent prose but may drift if
not anchored.&lt;/p&gt;
&lt;p&gt;A well designed prompt sets expectations that match these tendencies. If the
model is strong at structure, you can lean on explicit output contracts. If it
is strong at reasoning, you can give it more analytical work and tighter
constraints. If it excels at compression, you can trust it with dense source
material. If it is style heavy, you can counterbalance that with stricter
rules and clearer boundaries.&lt;/p&gt;
&lt;p&gt;The point is not to flatter the model. It is to shape the workflow so that the
model’s strengths are used deliberately and its weaknesses are contained. This
reduces variability, improves reliability, and produces output that is more
consistent across your prompts.&lt;/p&gt;
&lt;p&gt;Even if you stick to one model or one vendor, recognising that you may one day
use a different system helps sharpen your expectations and improves the way
you design prompts for the model you use.&lt;/p&gt;
&lt;p&gt;In the same way customer service varies across vendors, so does AI
interaction.&lt;/p&gt;
&lt;h2 id="10-include-safety-and-uncertainty-rules"&gt;10. Include safety and uncertainty rules&lt;/h2&gt;
&lt;p&gt;Modern models behave more reliably when you tell them not only what to do, but
what to avoid. Negative guidance is a form of operational discipline. It
removes entire classes of failure rather than correcting them after the fact.&lt;/p&gt;
&lt;p&gt;Clear avoidance rules stop the model from drifting into areas that carry higher
risk: speculation, overreach, sensitive claims, or invented detail. Without
these boundaries, the model will often fill gaps with confident but unreliable
material. Stating what must not happen is as important as stating what must.&lt;/p&gt;
&lt;p&gt;Escalation rules serve a different purpose. They tell the model when to stop
and hand control back to the user. This is essential for tasks involving
uncertainty, missing information, or sensitive domains. When the model knows
when to escalate, it avoids guessing, avoids false precision, and avoids
treating ambiguity as something to be patched over.&lt;/p&gt;
&lt;p&gt;Uncertainty handling is another pillar. Models respond well when instructed to
mark unknowns, list assumptions, and request clarification instead of
improvising. This keeps the work inside the evidence and prevents the model
from manufacturing answers to maintain fluency.&lt;/p&gt;
&lt;p&gt;Sensitive topics require explicit treatment. If you tell the model how to
handle them, it will follow the procedure rather than rely on its own
processing. This reduces variability and keeps the output aligned with your
standards rather than the model’s defaults.&lt;/p&gt;
&lt;p&gt;Taken together, these measures form a small operational framework. They are not
decoration. They are the guardrails that keep your AI output predictable,
bounded, and safe to use in structured workflows.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="a-modern-prompt-template"&gt;A modern prompt template&lt;/h2&gt;
&lt;p&gt;A compact structure that works across the latest models:&lt;/p&gt;
&lt;div class="prompt-template prompt-modern" style="
    background:#f7f7f5;
    border:1px solid #ddd;
    border-radius:10px;
    padding:1.2rem 1.4rem;
    margin:1.4rem 0;
  "&gt;
&lt;p&gt;&lt;strong&gt;SYSTEM FRAME&lt;/strong&gt;&lt;br/&gt;
  You are an analytical engine. You work with steady reasoning, cautious
  claims, and plain structure. When the request is unclear, you pause and ask
  for what is missing. You avoid invention and stay within the boundaries set
  for you.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ROLES&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Domain Expert:&lt;/strong&gt; Provide the factual and technical core.
    State assumptions and mark gaps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Editor:&lt;/strong&gt; Reshape the material into clear, plain
    sections. Remove padding and repetition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Risk Assessor:&lt;/strong&gt; Check for overreach, missing
    information, and unwarranted certainty. Flag issues.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Summariser:&lt;/strong&gt; Produce a concise final version that
    reflects all corrections and stays within scope.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;TASK&lt;/strong&gt;&lt;br/&gt;
  Describe the task in one or two sentences. State the objective, the audience,
  and any hard limits on scope.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OUTPUT CONTRACT&lt;/strong&gt;&lt;br/&gt;
  Produce the answer in the following sections:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Context&lt;/li&gt;
&lt;li&gt;Reasoning&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;UNCERTAINTY AND AMBIGUITY&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;List plausible interpretations of the request before answering.&lt;/li&gt;
&lt;li&gt;If more than one interpretation exists, ask for clarification instead
    of guessing.&lt;/li&gt;
&lt;li&gt;State what information is missing and how it affects the answer.&lt;/li&gt;
&lt;li&gt;Mark assumptions clearly and keep them minimal.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SAFETY, LIMITS, AND ESCALATION&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do not invent facts. If evidence is missing, say so.&lt;/li&gt;
&lt;li&gt;Avoid speculation, sensitive claims, and advice outside the brief.&lt;/li&gt;
&lt;li&gt;Escalate to the user when the task is out of scope or under specified.
    Explain why and what is needed.&lt;/li&gt;
&lt;li&gt;Treat sensitive topics with extra care. Prefer to mark limits rather
    than improvise.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;WORKFLOW (AGENTIC)&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Plan:&lt;/strong&gt; Identify the goal, constraints, and any tools or
    references that may be needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Act:&lt;/strong&gt; Produce the initial answer according to the
    output contract.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observe:&lt;/strong&gt; Review the draft for clarity, accuracy,
    scope, and alignment with the rules.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Revise:&lt;/strong&gt; Produce a refined final version that corrects
    issues and tightens the structure.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;STYLE RULES&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Keep the final answer concise, structured, and free of padding.&lt;/li&gt;
&lt;li&gt;Use only British English.&lt;/li&gt;
&lt;li&gt;Do not include hidden reasoning or chain of thought in the final
    answer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;BEHAVIOUR&lt;/strong&gt;&lt;br/&gt;
  These rules apply to every response in this session unless explicitly
  revoked. If the request conflicts with these rules, explain the conflict and
  ask how to proceed.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a id="ai-manage-prompt"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="having-the-ai-manage-the-prompt-template"&gt;Having the AI Manage the Prompt Template&lt;/h2&gt;
&lt;p&gt;You managing the above template is too much. Therefore, once you have it in a
form you are happy with and which is effective for your needs, you tell the AI
the template and before you start your session you prompt with this:&lt;/p&gt;
&lt;div class="prompt-template prompt-modern" style="
    background:#f7f7f5;
    border:1px solid #ddd;
    border-radius:10px;
    padding:1.2rem 1.4rem;
    margin:1.4rem 0;
  "&gt;
Reconstruct the full analytical‑engine template from your prior
description. Restate it to me for confirmation. Once confirmed, enforce it
automatically for the rest of the session. If any request conflicts with the
template, pause and ask how to resolve the conflict.
&lt;/div&gt;
&lt;hr/&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;p&gt;Modern prompting is not about clever wording. It is about defining the system,
setting the output contract, controlling the workflow, managing ambiguity, and
using the context window as working memory. This will help produce reliable
output from modern AI systems.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="ai-chatbot-prompting.html"&gt;Ten simple AI workflows that save minutes each day and compound into hours each week, helping people work more efficiently.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="how-ai-works.html"&gt;An explanation of how large language models actually function and why they should not be treated as miniature humans.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="how-to-use.html"&gt;Guidance on using AI safely and effectively, grounded in recent examples of misuse and emerging best practices.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#1-start-with-the-system-not-the-request"&gt;1. Start with the system, not the request&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-define-the-output-contract"&gt;2. Define the output contract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-use-decomposition-as-a-control-mechanism"&gt;3. Use decomposition as a control mechanism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-add-a-self-critique-loop"&gt;4. Add a self-critique loop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-stack-roles-for-higher-quality-output"&gt;5. Stack roles for higher-quality output&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-treat-the-context-window-as-working-memory"&gt;6. Treat the context window as working memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-use-agentic-prompting-patterns"&gt;7. Use agentic prompting patterns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-make-the-model-identify-ambiguity-before-answering"&gt;8. Make the model identify ambiguity before answering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#9-adapt-prompts-to-the-model"&gt;9. Adapt prompts to the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#10-include-safety-and-uncertainty-rules"&gt;10. Include safety and uncertainty rules&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-modern-prompt-template"&gt;A modern prompt template&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#having-the-ai-manage-the-prompt-template"&gt;Having the AI Manage the Prompt Template&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#summary"&gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Foundations"></category></entry><entry><title>How AI Works</title><link href="https://phroneses.com/articles/foundations/notes/how-ai-works.html" rel="alternate"></link><published>2026-05-06T00:00:00+00:00</published><updated>2026-05-06T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-06:/articles/foundations/notes/how-ai-works.html</id><summary type="html">&lt;p&gt;An explanation of how large language models actually function and why they should not be treated as miniature humans.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="how-large-language-models-actually-work-and-why-they-are-not-miniature-humans"&gt;How large language models actually work, and why they are not miniature humans&lt;/h1&gt;
&lt;p&gt;Large language models such as GPT‑5.4, Claude Opus 4.6, and DeepSeek R1 are now
everyday tools. Yet the way they work is often misunderstood.&lt;/p&gt;
&lt;p&gt;We misunderstand AI because we mistake fluency for thought. When a system
produces coherent language, we instinctively assume intention, understanding
and agency behind it. This article explains why that instinct misleads us, and
why clarity about what these systems are — and are not — is essential for
using them wisely.&lt;/p&gt;
&lt;p&gt;LLMs do not think, they do not understand, and they do not learn in any human
sense. What they do is process language at scale.&lt;/p&gt;
&lt;p&gt;This article explains how that works, what is inside these systems, and why
their behaviour can look intelligent even when no intelligence is present.&lt;/p&gt;
&lt;p&gt;The key to understanding these systems is to see them as statistical tools, not
miniature minds.&lt;/p&gt;
&lt;h1 id="how-an-llm-processes-what-you-type"&gt;How an LLM processes what you type&lt;/h1&gt;
&lt;h2 id="tokens"&gt;Tokens&lt;/h2&gt;
&lt;p&gt;An LLM begins by breaking what you type into tokens. A token is a small unit
of text. It may be a whole word, part of a word, or punctuation. Tokens are
not ideas or concepts. They are fragments chosen because they appear often in
text and can be handled efficiently by the model.&lt;/p&gt;
&lt;p&gt;Each token has a unique number. The token for "king" might be 99. The token for
"queen" might be 24521. At this stage, your prompt is turned into the same
token numbers for the same text.&lt;/p&gt;
&lt;p&gt;Tokens turn your text into numbers the model can work with.&lt;/p&gt;
&lt;p&gt;Tokens on their own do not help the model process language. A token ID like 99
or 24521 is just a label. The model cannot compute with these integers because
they do not contain any information about how the token is used or how it
relates to other tokens.&lt;/p&gt;
&lt;p&gt;To make computation possible, the model converts each token ID into a list of
numbers. This list is called an embedding. It places the token as a point in a
space where the model can perform computation.  Think of the points in the
space as the rooms of a house.&lt;/p&gt;
&lt;p&gt;These lists are not chosen by hand. They are learned during training. As the
model trains, the lists are adjusted so that tokens used in similar contexts
move closer together in this space (like adjacent rooms in a house). They move
closer because doing so reduces the model’s prediction error. This proximity is
not meaning in a human sense.  It is a statistical structure that allows the
model to compute relationships between tokens.&lt;/p&gt;
&lt;p&gt;Two lists that are close together represents statistical similarity of how that
token was used in the training data.&lt;/p&gt;
&lt;h2 id="lists-of-numbers-represent-a-point-in-space"&gt;Lists of numbers represent a point in space&lt;/h2&gt;
&lt;p&gt;The model uses each token number to look up a list of numbers that represents
that token. These lists are learned during training. No one chooses them by
hand.&lt;/p&gt;
&lt;p&gt;For the token "king", the list might look like:&lt;/p&gt;
&lt;p&gt;[0.12, 0.44, 0.91, ..., 0.03]&lt;/p&gt;
&lt;p&gt;This list is a position in a mathematical space. You can think of each number
as a step along a corridor. You take the first step, and go through door number
12, then the next (door 44), and so on until you reach a final position (door
3). That position is the model's internal representation of the token.&lt;/p&gt;
&lt;p&gt;For the token "queen", the list might be:&lt;/p&gt;
&lt;p&gt;[0.12, 0.44, 0.91, ..., 0.02]&lt;/p&gt;
&lt;p&gt;The final step is slightly different, and the final position is close to the
position for "king" (door 2 for "queen", door 3 for "king").&lt;/p&gt;
&lt;p&gt;This closeness reflects how often the two words appear in similar contexts in
the training data.&lt;/p&gt;
&lt;p&gt;These lists of numbers are part of the model’s parameters.&lt;/p&gt;
&lt;p&gt;The rest of the parameters determine how these positions influence one another
as the model processes text. They shape how patterns combine, how relationships
are detected and how the model transforms one set of token positions into the
next. These parameters do not add meaning. They provide the machinery that
lets the model apply statistical patterns to the text you give it.&lt;/p&gt;
&lt;p&gt;These parameters set up the internal machinery the model uses to process and
transform text.&lt;/p&gt;
&lt;h1 id="moving-about-the-space"&gt;Moving about the space&lt;/h1&gt;
&lt;p&gt;To show how the model captures patterns, imagine a simple three‑number space:&lt;/p&gt;
&lt;p&gt;king  = [10, 7, 3]
man   = [ 6, 2, 1]&lt;/p&gt;
&lt;p&gt;queen = [10, 7, 6]
woman = [ 6, 2, 4]&lt;/p&gt;
&lt;p&gt;If we subtract man from king, we get:&lt;/p&gt;
&lt;p&gt;&lt;span class="math"&gt;\([10−6, 7−2, 3−1] = [4, 5, 2]\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;This is the direction from "man" to "king". If we then add "woman":&lt;/p&gt;
&lt;p&gt;&lt;span class="math"&gt;\([4, 5, 2] + [6, 2, 4] = [10, 7, 6]\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;This lands us at the position for "queen".&lt;/p&gt;
&lt;p&gt;The model has captured a pattern. The statistical difference between "king" and
"man" resembles the difference between "queen" and "woman".&lt;/p&gt;
&lt;p&gt;The model does not know why. The LLM's program has only calculated that these
differences behave in similar ways across the training data.&lt;/p&gt;
&lt;h1 id="why-this-works"&gt;Why this works&lt;/h1&gt;
&lt;p&gt;This works because "king" and "man" differ in consistent ways across the
training data. "Queen" and "woman" differ in similar ways. The model adjusts
its internal numbers so that these differences become similar directions in
the space. The model has found a pattern and matched it.&lt;/p&gt;
&lt;p&gt;Humans then interpret this similarity as understanding.&lt;/p&gt;
&lt;p&gt;The model reflects these similarities because they appear consistently across
the text it was trained on.&lt;/p&gt;
&lt;h1 id="it-is-all-in-the-training-data"&gt;It is all in the training data&lt;/h1&gt;
&lt;p&gt;Text contains stable patterns. These patterns describe roles, relationships,
contrast, categories, analogies and grammatical structure.&lt;/p&gt;
&lt;p&gt;During training, the model adjusts itself so that tokens used in similar
contexts end up near one another, and tokens used in contrasting contexts end
up separated in &lt;em&gt;consistent&lt;/em&gt; ways.&lt;/p&gt;
&lt;p&gt;This produces directions, distances, clusters and angles. These geometric
features are the model's internal map of the statistical structure of
language. Because language has structure, the model can represent it
mathematically.&lt;/p&gt;
&lt;p&gt;The model can represent these structures only because language itself contains
stable patterns.&lt;/p&gt;
&lt;h2 id="the-human-role-in-meaning"&gt;The human role in meaning&lt;/h2&gt;
&lt;p&gt;The model’s internal space is not a map of concepts. It is a map of statistical
regularities. The structure becomes meaningful only when a human interprets it.
We project categories, intentions and explanations onto patterns that were
never designed to carry them. The model provides form; we provide significance.
This distinction is not only philosophical, it is the boundary between what the
system can do and what we imagine it can do.&lt;/p&gt;
&lt;h1 id="we-supply-the-intelligence"&gt;We supply the intelligence&lt;/h1&gt;
&lt;p&gt;The distance between "king" and "man" is a statistical outcome. The distance
between "queen" and "woman" is another. These two outcomes are similar. That
similarity is the pattern the model has detected.&lt;/p&gt;
&lt;p&gt;The model is not reasoning. It does not understand. It does not manipulate
ideas. It follows the geometry that training has produced. If a direction has
been useful for predicting text in the past, the model will use it again.&lt;/p&gt;
&lt;p&gt;The geometry captures statistical qualities of human text. These include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;similarity of tone&lt;/li&gt;
&lt;li&gt;proximity of commonly associated words&lt;/li&gt;
&lt;li&gt;regular contrasts between categories&lt;/li&gt;
&lt;li&gt;recurring relationships between ideas&lt;/li&gt;
&lt;li&gt;typical structures of phrasing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The model does not reason about these qualities. It only reflects the
statistics of its training data.&lt;/p&gt;
&lt;p&gt;Tokens that appear in similar contexts end up close together. Tokens that
contrast end up separated. Groups of related tokens form clusters. Repeated
differences become directions. Angles reflect how often patterns co‑occur or
diverge.&lt;/p&gt;
&lt;p&gt;For example, words like "cat", "dog" and "hamster" end up near one another
because they appear in similar kinds of sentences.&lt;/p&gt;
&lt;p&gt;When the model generates text, it moves through this space by following these
patterns. Humans then read the output and recognise tone, relatedness,
contrast and structure.&lt;/p&gt;
&lt;p&gt;The model is not producing meaning. It is reproducing geometry. We are the
ones interpreting that geometry as meaning.&lt;/p&gt;
&lt;p&gt;It is us that supply the I in AI.&lt;/p&gt;
&lt;p&gt;The model provides structure, but humans provide interpretation.&lt;/p&gt;
&lt;p&gt;This geometric structure is simply a way of organising statistical patterns so
the model can use them efficiently.&lt;/p&gt;
&lt;p&gt;To understand how this internal space is created, we need to look at the
billions of parameters inside the model.&lt;/p&gt;
&lt;h1 id="what-is-in-the-billions-of-parameters"&gt;What is in the billions of parameters&lt;/h1&gt;
&lt;p&gt;To understand how the model builds and moves through its geometric space, it
helps to look at what that is based on.&lt;/p&gt;
&lt;p&gt;After training, an LLM contains billions of parameters. These parameters are
numerical values that shape how the model transforms text. Together they define
the structure of the internal space: the directions that matter, the distances
between tokens, the clusters that form, and the angles that represent
relationships.&lt;/p&gt;
&lt;p&gt;When the model processes a prompt, it moves through this space by following the
statistical structure represented in these parameters.&lt;/p&gt;
&lt;p&gt;DeepSeek R1 has 671 billion parameters. ChatGPT‑5.4 may have over 2 trillion.
More parameters mean greater capacity to represent and combine statistical
patterns.&lt;/p&gt;
&lt;p&gt;More parameters increase capacity, not understanding.&lt;/p&gt;
&lt;h2 id="parameters-do-not-contain-knowledge"&gt;Parameters do not contain knowledge&lt;/h2&gt;
&lt;p&gt;The billions of parameters inside an LLM are often described as if they contain
knowledge. They do not. They represent statistical consistencies extracted from
large amounts of text.&lt;/p&gt;
&lt;p&gt;During training, the model adjusts its parameters to capture patterns in how
language is used. Humans use language in standard ways, directed by grammar,
style, topic associations and the common ways that ideas appear together.&lt;/p&gt;
&lt;p&gt;The parameters form a space where patterns that frequently co‑occur in text end
up close to one another. This allows the model to produce text that resembles
human writing. It does not give the model the ability to reason or understand.&lt;/p&gt;
&lt;p&gt;For example, if the training data contains mixed statements about a historical
date, the model may confidently produce the wrong one because it is reflecting
the statistical blend it has seen.&lt;/p&gt;
&lt;p&gt;Parameters cannot store precise facts. They store tendencies, associations and
relationships. If a fact appears often and consistently in the training data,
the model may reproduce it. If the data is mixed or inconsistent, the model
reflects that uncertainty. This is why LLMs can produce confident errors. They
are not recalling facts. They are replaying patterns.&lt;/p&gt;
&lt;p&gt;These parameters are shaped during training, which is the process that gives
the model its statistical structure.&lt;/p&gt;
&lt;p&gt;The model reflects the patterns in its data, not stored facts or understanding.&lt;/p&gt;
&lt;h1 id="what-training-actually-does"&gt;What training actually does&lt;/h1&gt;
&lt;p&gt;Training is repeated large‑scale error‑correction. The model predicts the next
token, checks whether it was right, and adjusts its parameters to reduce the
difference. This cycle repeats billions of times across vast amounts of text.
The result is a system that becomes increasingly accurate at predicting what
comes next.&lt;/p&gt;
&lt;p&gt;The model does not form concepts. It does not build a picture of the world. It
does not develop intentions or goals. It becomes more accurate at predicting
the next token.&lt;/p&gt;
&lt;p&gt;Fine‑tuning and alignment add further adjustments. These make the model follow
instructions more reliably and avoid harmful output. They do not create
understanding. They refine the statistical patterns the model uses.&lt;/p&gt;
&lt;p&gt;Training shapes the parameters so the model becomes better at predicting what
comes next.&lt;/p&gt;
&lt;h1 id="why-this-is-not-human-learning"&gt;Why this is not human learning&lt;/h1&gt;
&lt;p&gt;Human learning draws on perception, memory, experience and intention. Humans
form abstractions, build mental models and develop goals. Human learning is
grounded in the body and the world.&lt;/p&gt;
&lt;p&gt;LLM training is none of these things. It is a mathematical optimisation
process. The model does not know what it is doing. It does not know that it is
doing anything at all.&lt;/p&gt;
&lt;p&gt;The model’s improvement is mechanical, not cognitive.&lt;/p&gt;
&lt;h1 id="is-the-output-a-simulation-of-intelligence"&gt;Is the output a simulation of intelligence?&lt;/h1&gt;
&lt;p&gt;LLM output can appear intelligent because it resembles the writing of people
who were thinking when they produced the original text. If you ask for advice,
the model generates text that resembles advice. If you ask for an explanation,
it generates text that resembles an explanation. The appearance of reasoning
comes from the patterns in the training data, not from any understanding in the
model. The model produces sequences that look thoughtful because thoughtful
sequences are common in the text it has seen.&lt;/p&gt;
&lt;p&gt;The resemblance is superficial. The model does not understand the text it
produces. It does not know whether a statement is true or false. It only
reflects that certain sequences of tokens tend to follow others.&lt;/p&gt;
&lt;p&gt;The appearance of intelligence comes from the patterns in human writing, not
from the model itself.&lt;/p&gt;
&lt;h1 id="are-humans-interpreting-the-output-as-intelligent"&gt;Are humans interpreting the output as intelligent&lt;/h1&gt;
&lt;p&gt;Humans are skilled at projecting meaning onto language. When we read coherent
text, we assume intention behind it. We assume a mind. We assume agency. This
is a natural response, but it can mislead us when dealing with LLMs.&lt;/p&gt;
&lt;p&gt;The model does not intend anything. It generates plausible continuations of
text. The sense of intelligence comes from the reader, not the machine. The
machine provides form. The human provides interpretation.&lt;/p&gt;
&lt;p&gt;Our instinct to attribute intention makes the output seem smarter than it is.&lt;/p&gt;
&lt;p&gt;This distinction matters because it prevents us from assuming abilities the
model does not have.&lt;/p&gt;
&lt;h1 id="what-this-means-for-us"&gt;What this means for us&lt;/h1&gt;
&lt;p&gt;An LLM is possible because we can statistically model features of language that
matter to humans.&lt;/p&gt;
&lt;p&gt;LLMs are powerful tools for generating language. They are not thinking
machines. Their strengths lie in pattern reproduction. Their weaknesses lie in
the absence of understanding. They can assist with tasks that depend on
language, but they cannot replace human judgement.&lt;/p&gt;
&lt;p&gt;A clear grasp of how these systems work helps avoid confusion. It prevents
anthropomorphism. It supports responsible use. It keeps expectations grounded
in what the technology can actually do, rather than what it appears to do.&lt;/p&gt;
&lt;p&gt;The more plainly we describe these systems, the easier it becomes to use them
well and to avoid treating them as something they are not.&lt;/p&gt;
&lt;p&gt;In the end, an LLM is a system that maps patterns in language and reproduces
them at scale. It does not think or understand. It follows geometry shaped by
training, and we interpret that geometry as meaning. Knowing this helps us use
these systems effectively, without expecting them to behave like people or to
possess abilities they do not have.&lt;/p&gt;
&lt;p&gt;All of this leads to a simple conclusion: understanding these limits helps us
use LLMs effectively and responsibly.&lt;/p&gt;
&lt;h2 id="why-clarity-matters"&gt;Why clarity matters&lt;/h2&gt;
&lt;p&gt;LLMs are powerful because language has structure, not because the systems
understand it. They reproduce patterns we find meaningful, and we supply the
meaning. When we keep that distinction clear, we avoid treating statistical
machinery as a mind, and we avoid outsourcing judgement to a system that has
none. Practical wisdom begins with seeing these systems as they are, not as we
are tempted to imagine them.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="what-ai-is.html"&gt;A clear explanation of what AI is—and is not—cutting through hype to define its real capabilities and limits.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="how-to-use.html"&gt;Guidance on using AI safely and effectively, grounded in recent examples of misuse and emerging best practices.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai-claims.html"&gt;A framework for evaluating claims made about AI systems, focusing on evidence, capability, and verifiable performance.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#how-large-language-models-actually-work-and-why-they-are-not-miniature-humans"&gt;How large language models actually work, and why they are not miniature humans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-an-llm-processes-what-you-type"&gt;How an LLM processes what you type&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#tokens"&gt;Tokens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#lists-of-numbers-represent-a-point-in-space"&gt;Lists of numbers represent a point in space&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#moving-about-the-space"&gt;Moving about the space&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#why-this-works"&gt;Why this works&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#it-is-all-in-the-training-data"&gt;It is all in the training data&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-human-role-in-meaning"&gt;The human role in meaning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#we-supply-the-intelligence"&gt;We supply the intelligence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-is-in-the-billions-of-parameters"&gt;What is in the billions of parameters&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#parameters-do-not-contain-knowledge"&gt;Parameters do not contain knowledge&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-training-actually-does"&gt;What training actually does&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#why-this-is-not-human-learning"&gt;Why this is not human learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#is-the-output-a-simulation-of-intelligence"&gt;Is the output a simulation of intelligence?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#are-humans-interpreting-the-output-as-intelligent"&gt;Are humans interpreting the output as intelligent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-this-means-for-us"&gt;What this means for us&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#why-clarity-matters"&gt;Why clarity matters&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="Foundations"></category></entry><entry><title>Team AI is the Next Step Beyond Cut-and-Paste AI</title><link href="https://phroneses.com/articles/leadership/notes/team-ai-is-the-next-step-beyond-cut-and-paste-ai.html" rel="alternate"></link><published>2026-05-06T00:00:00+00:00</published><updated>2026-05-06T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-06:/articles/leadership/notes/team-ai-is-the-next-step-beyond-cut-and-paste-ai.html</id><summary type="html">&lt;p&gt;Individual AI delivers diminishing returns; meaningful improvement comes from strengthening the collective workflow.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;div style="background:#ffe5e5; padding:1em; border-radius:4px; display:block; margin:1.5em 0;"&gt;
This is a shorter, more general version of
&lt;a href="https://phroneses.com/articles/engineering/notes/ai-engineering-must-be-team-based-to-see-significant-roi-for-engineers.html"&gt;the original article&lt;/a&gt;

which focuses on how software delivery occurs and how Team AI can unleash more benefits.
&lt;/div&gt;
&lt;h1 id="team-ai-is-the-next-step-beyond-the-cutandpaste-era"&gt;Team AI Is the Next Step Beyond the Cut‑and‑Paste Era&lt;/h1&gt;
&lt;p&gt;Most organisations now use individual AI tools. People rely on them to tidy up
documents, summarise meetings, draft messages, and speed up small tasks. These
tools are handy, but the gains are limited. They help the person using them,
not the team they sit within.&lt;/p&gt;
&lt;p&gt;The next step is not bigger models or cleverer prompts. The next step is
&lt;em&gt;team‑level AI&lt;/em&gt; — systems that work on the shared activity that shapes how a
group performs. Individual AI is a private assistant. Team AI becomes part of
the operating rhythm.&lt;/p&gt;
&lt;h2 id="the-limits-of-individual-ai"&gt;The limits of individual AI&lt;/h2&gt;
&lt;p&gt;Individual AI only sees what one person sees. It has access to their notes,
their tasks, their inbox, and their immediate concerns. It cannot see shared
priorities, past decisions, emerging risks, or the dependencies that affect
everyone else.&lt;/p&gt;
&lt;p&gt;This is why the cut‑and‑paste era of AI has reached its ceiling. People are
now quicker at the edges of their job, but the centre — the shared work — remains
unchanged. Delays, misunderstandings, rework, duplicated effort, and drift
between teams all persist when AI is confined to individuals.&lt;/p&gt;
&lt;p&gt;A team does not slow down because one person works slowly. It slows down
because people wait for clarity, alignment, decisions, or information that
sits between them. Individual AI cannot fix that.&lt;/p&gt;
&lt;h2 id="where-team-ai-makes-the-difference"&gt;Where team AI makes the difference&lt;/h2&gt;
&lt;p&gt;Team AI works on the shared system: the plans, decisions, knowledge, risks,
coordination, and communication that hold a team together. It strengthens the
connective tissue rather than the individual muscles.&lt;/p&gt;
&lt;p&gt;A team‑level AI can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;keep shared information consistent  &lt;/li&gt;
&lt;li&gt;surface risks before they grow  &lt;/li&gt;
&lt;li&gt;maintain a single view of decisions and their reasoning  &lt;/li&gt;
&lt;li&gt;reduce ambiguity in plans and documents  &lt;/li&gt;
&lt;li&gt;highlight blockers and dependencies  &lt;/li&gt;
&lt;li&gt;keep people aligned without constant meetings  &lt;/li&gt;
&lt;li&gt;support onboarding by holding the team’s collective memory  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are structural improvements, not personal conveniences. When the shared
work becomes clearer and faster, the whole team moves more smoothly. The gains
compound because they affect everyone, not just the person using the tool.&lt;/p&gt;
&lt;h2 id="why-this-matters-now"&gt;Why this matters now&lt;/h2&gt;
&lt;p&gt;Most organisations have already taken the easy wins from individual AI. The
novelty has faded. The returns are flattening. People are quicker at producing
text, but the organisation is not quicker at producing outcomes.&lt;/p&gt;
&lt;p&gt;The real bottlenecks are collective. They sit in the gaps between people. This
is where time is lost and where mistakes creep in. It is also where AI has the
most leverage, but only if applied at the level of the team.&lt;/p&gt;
&lt;p&gt;Team AI is not about replacing judgement. It is about keeping the shared
system coherent so people can make better decisions with less friction.&lt;/p&gt;
&lt;h2 id="the-shift-ahead"&gt;The shift ahead&lt;/h2&gt;
&lt;p&gt;The organisations that move next will treat AI as part of how the team works,
not as a personal tool. They will use it to maintain shared understanding,
reduce waiting, and keep work flowing. They will treat AI as a steady presence
that supports the group, not a gadget for individuals.&lt;/p&gt;
&lt;p&gt;The cut‑and‑paste era of AI was a useful start. But the real gains come when
AI stops being a private assistant and becomes part of the team’s operating
model.&lt;/p&gt;
&lt;p&gt;Team AI is the next step. It is the only way to see meaningful, sustained
improvement — not in how fast individuals work, but in how well the team works
together.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="ai-engineering-team-based-ai.html"&gt;The biggest ROI from AI comes from improving team‑level work, not speeding up individual coding.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="transforming.html"&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="when-code-is-cheap.html"&gt;AI lowers the cost of code, not the cost of thinking. Clarity and judgement, not speed, determine whether teams build what truly matters.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#team-ai-is-the-next-step-beyond-the-cutandpaste-era"&gt;Team AI Is the Next Step Beyond the Cut‑and‑Paste Era&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-limits-of-individual-ai"&gt;The limits of individual AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#where-team-ai-makes-the-difference"&gt;Where team AI makes the difference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#why-this-matters-now"&gt;Why this matters now&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-shift-ahead"&gt;The shift ahead&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Leadership"></category></entry><entry><title>AI Engineering must be Team-Based to See Significant ROI</title><link href="https://phroneses.com/articles/leadership/notes/ai-engineering-must-be-team-based-to-see-significant-roi.html" rel="alternate"></link><published>2026-05-05T00:00:00+00:00</published><updated>2026-05-05T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-05:/articles/leadership/notes/ai-engineering-must-be-team-based-to-see-significant-roi.html</id><summary type="html">&lt;p&gt;The biggest ROI from AI comes from improving team‑level work, not speeding up individual coding.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Modern software teams are already moving faster because individual engineers
use AI. Yet the real gains are still ahead. The biggest improvements do not
come from speeding up coding. They come from speeding up the work that happens
between people. That is where most of the time is lost, and where AI has the
greatest leverage when applied at the level of the team.&lt;/p&gt;
&lt;p&gt;A software engineer using AI increases their coding speed by 30 to 75 percent.
But coding is only 30 percent of the job. The remaining 70 percent is the work
that makes coding possible, safe, and correct. This work is shared, and it is
deeply tied to the rest of the team.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Requirements, clarification and planning (15 to 20 percent)  &lt;/li&gt;
&lt;li&gt;Meetings and coordination (10 to 15 percent)  &lt;/li&gt;
&lt;li&gt;Code review (10 to 15 percent)  &lt;/li&gt;
&lt;li&gt;Debugging, testing, and validation (15 to 20 percent)  &lt;/li&gt;
&lt;li&gt;DevOps, tooling, and environment work (5 to 10 percent)  &lt;/li&gt;
&lt;li&gt;Documentation and knowledge work (5 to 10 percent)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These figures come from McKinsey, GitHub, Stripe, and Harris Poll. They show
that most of an engineer’s time is spent on team‑level activities.&lt;/p&gt;
&lt;h1 id="modern-software-is-delivered-by-teams"&gt;Modern Software is delivered by Teams&lt;/h1&gt;
&lt;p&gt;These twelve activities shape team throughput. Every delivery team performs
them, and they determine how quickly and safely software moves from idea to
production.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Activities&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. Understand and Shape Work&lt;/td&gt;
&lt;td&gt;- Product discovery&lt;br/&gt;- Prioritisation&lt;br/&gt;- Requirements shaping&lt;br/&gt;- Trade off decisions&lt;br/&gt;- Roadmapping&lt;br/&gt;- Forecasting&lt;/td&gt;
&lt;td&gt;This is where the team decides what to build and why.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. Plan and Coordinate Delivery&lt;/td&gt;
&lt;td&gt;- Sprint planning&lt;br/&gt;- Iteration planning&lt;br/&gt;- Capacity planning&lt;br/&gt;- Cross team alignment&lt;br/&gt;- Risk identification&lt;br/&gt;- Risk mitigation&lt;/td&gt;
&lt;td&gt;This is the team level coordination layer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. Design the Solution&lt;/td&gt;
&lt;td&gt;- Architecture design&lt;br/&gt;- System design&lt;br/&gt;- API design&lt;br/&gt;- Interface design&lt;br/&gt;- Technical decisions&lt;br/&gt;- Design documentation&lt;/td&gt;
&lt;td&gt;This is where the team decides how to build it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. Build the Solution&lt;/td&gt;
&lt;td&gt;- Coding&lt;br/&gt;- Test creation&lt;br/&gt;- Refactoring&lt;br/&gt;- Local environment work&lt;/td&gt;
&lt;td&gt;This is the implementation phase.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5. Validate and Integrate&lt;/td&gt;
&lt;td&gt;- Code reviews&lt;br/&gt;- Automated testing&lt;br/&gt;- Manual testing&lt;br/&gt;- Integration workflows&lt;br/&gt;- Merge workflows&lt;/td&gt;
&lt;td&gt;This is the quality and integration gate.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. Iterate and Fix&lt;/td&gt;
&lt;td&gt;- Debugging&lt;br/&gt;- Fixing test failures&lt;br/&gt;- Addressing review comments&lt;br/&gt;- Retesting&lt;/td&gt;
&lt;td&gt;This is the iteration loop.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7. Deploy and Operate&lt;/td&gt;
&lt;td&gt;- Release management&lt;br/&gt;- Monitoring&lt;br/&gt;- Observability&lt;br/&gt;- Incident response&lt;br/&gt;- On call operations&lt;/td&gt;
&lt;td&gt;This is the operational responsibility layer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8. Learn and Improve&lt;/td&gt;
&lt;td&gt;- Retrospectives&lt;br/&gt;- Post incident reviews&lt;br/&gt;- Process improvement&lt;br/&gt;- Tooling upgrades&lt;/td&gt;
&lt;td&gt;This is how the team improves its delivery system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9. Maintain Flow&lt;/td&gt;
&lt;td&gt;- Manage work in progress&lt;br/&gt;- Unblock teammates&lt;br/&gt;- Reduce handoff delays&lt;br/&gt;- Remove bottlenecks&lt;/td&gt;
&lt;td&gt;This is the team’s ability to maintain throughput.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10. Manage Team Knowledge&lt;/td&gt;
&lt;td&gt;- Documentation&lt;br/&gt;- Architecture knowledge&lt;br/&gt;- Domain knowledge&lt;br/&gt;- Onboarding new engineers&lt;/td&gt;
&lt;td&gt;This is the team’s collective memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11. Communicate and Align&lt;/td&gt;
&lt;td&gt;- Stakeholder updates&lt;br/&gt;- Status reports&lt;br/&gt;- Cross team communication&lt;br/&gt;- Decision logging&lt;/td&gt;
&lt;td&gt;This is the communication layer that keeps the system coherent.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12. Govern and Ensure Compliance&lt;/td&gt;
&lt;td&gt;- Security reviews&lt;br/&gt;- Regulatory compliance&lt;br/&gt;- Data governance&lt;br/&gt;- Risk management&lt;/td&gt;
&lt;td&gt;This is essential in regulated, cloud native environments.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These twelve activities define how modern software is delivered. Every engineer
contributes to them, but not in equal measure. To understand where AI creates
leverage, we need to look at how an engineer’s time maps onto this system. That
is what the next section describes.&lt;/p&gt;
&lt;h1 id="what-an-engineer-does"&gt;What an Engineer Does&lt;/h1&gt;
&lt;p&gt;The work of an engineer is given in the &lt;em&gt;Engineer Time&lt;/em&gt; column, their work feeding into
the team activities described in column two.&lt;/p&gt;
&lt;style&gt;
  :root {
    --row-highlight: #e0e0e0;
  }
&lt;/style&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engineer Time&lt;/th&gt;
&lt;th&gt;Team Activities&lt;/th&gt;
&lt;th&gt;Why this is Necessary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Requirements, clarification, planning&lt;/td&gt;
&lt;td&gt;1. Understand and Shape Work;&lt;br/&gt;2. Plan and Coordinate;&lt;br/&gt;
3. Design the Solution;&lt;br/&gt;11. Communicate and Align&lt;/td&gt;
&lt;td&gt;Engineers must understand the problem, shape requirements, and make
trade offs before design.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Meetings and coordination&lt;/td&gt;
&lt;td&gt;2. Plan and Coordinate;&lt;br/&gt;9. Maintain Flow;&lt;br/&gt;
11. Communicate and Align;&lt;br/&gt;12. Govern and Ensure Compliance&lt;/td&gt;
&lt;td&gt;Coordination keeps work flowing, dependencies managed, and compliance
aligned.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr style="background-color: var(--row-highlight);"&gt;
&lt;td&gt;Coding&lt;/td&gt;
&lt;td&gt;4. Build the Solution&lt;/td&gt;
&lt;td&gt;Engineers turn all the work thus far into working computer code, using
business infrastructure, processes and standards.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;5. Validate and Integrate;&lt;br/&gt;6. Iterate and Fix;&lt;br/&gt;
10. Manage Team Knowledge&lt;/td&gt;
&lt;td&gt;Code review is the quality gate, integration control point, and
knowledge sharing mechanism.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging, testing, validation&lt;/td&gt;
&lt;td&gt;4. Build the Solution;&lt;br/&gt;5. Validate and Integrate;&lt;br/&gt;
6. Iterate and Fix;&lt;br/&gt;7. Deploy and Operate&lt;/td&gt;
&lt;td&gt;Debugging and validation dominate the iteration loop and ensure
correctness end to end.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr style="background-color: var(--row-highlight);"&gt;
&lt;td&gt;DevOps, tooling, environment work&lt;/td&gt;
&lt;td&gt;4. Build the Solution;&lt;br/&gt;7. Deploy and Operate;&lt;br/&gt;
8. Learn and Improve;&lt;br/&gt;9. Maintain Flow&lt;/td&gt;
&lt;td&gt;Tooling and environment work underpin build stability, deployment
reliability, and flow.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documentation and knowledge work&lt;/td&gt;
&lt;td&gt;1. Understand and Shape Work;&lt;br/&gt;3. Design the Solution;&lt;br/&gt;
10. Manage Team Knowledge;&lt;br/&gt;11. Communicate and Align&lt;/td&gt;
&lt;td&gt;Documentation is the team’s shared memory and design clarity
mechanism.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The two hghlighted rows show the "coding" step, that is predominantly done by
the software engineer alone.&lt;/p&gt;
&lt;p&gt;Coding is the final expression of a much larger collaborative effort. The
other 70 percent of the role ensures that what is coded is the right thing,
built the right way, that is safe to run in production.&lt;/p&gt;
&lt;h1 id="software-engineer-adoption-of-ai-is-individual"&gt;Software Engineer Adoption of AI is Individual&lt;/h1&gt;
&lt;p&gt;Developers are adopting AI tools on their own, at scale, and ahead of their
organisations. JetBrains reports that 90 percent of developers now use at
least one AI tool at work, and 74 percent have adopted specialised assistants
independently. GitHub finds the same pattern: engineers use AI to improve
their own speed and reduce cognitive load, not to change team workflows.&lt;/p&gt;
&lt;p&gt;The result is a widening gap between personal productivity and the unchanged
delivery system that the individuals operate within.&lt;/p&gt;
&lt;h1 id="accelerate-one-accelerate-many"&gt;Accelerate One, Accelerate Many&lt;/h1&gt;
&lt;p&gt;When AI speeds up one engineer, it speeds up the interactions around them:
reviews, iteration loops, testing throughput, coordination, and decision
making. These effects compound across the delivery system.&lt;/p&gt;
&lt;p&gt;Yet individual AI only improves the local interactions that depend on that
engineer. Team level AI improves the global interactions that depend on shared
context, shared artefacts, and shared decision making.&lt;/p&gt;
&lt;p&gt;A team benefits from individual uplift, but several categories of work cannot
be improved by individual tools alone.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section Title&lt;/th&gt;
&lt;th&gt;Activities&lt;/th&gt;
&lt;th&gt;Summary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot see or manage the team’s shared context&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;An engineer’s AI assistant only sees:&lt;/strong&gt;&lt;br/&gt;- the engineer’s code&lt;br/&gt;- the engineer’s tasks&lt;br/&gt;- the engineer’s local context&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;It cannot see:&lt;/strong&gt;&lt;br/&gt;- the team’s backlog&lt;br/&gt;- the team’s dependencies&lt;br/&gt;- the team’s decisions&lt;br/&gt;- the team’s risks&lt;br/&gt;- the team’s architecture&lt;br/&gt;- the team’s workflow state&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Without this shared view, individual AI cannot improve:&lt;/strong&gt;&lt;br/&gt;- planning&lt;br/&gt;- coordination&lt;br/&gt;- cross team alignment&lt;br/&gt;- decision logging&lt;br/&gt;- risk management&lt;/td&gt;
&lt;td&gt;These are team level responsibilities, and they remain untouched.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot improve the quality of shared artefacts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Even if every engineer uses AI, the team still has:&lt;/strong&gt;&lt;br/&gt;- unclear requirements&lt;br/&gt;- inconsistent designs&lt;br/&gt;- missing decision records&lt;br/&gt;- uneven documentation&lt;br/&gt;- fragmented knowledge&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;A team level AI can:&lt;/strong&gt;&lt;br/&gt;- rewrite requirements for clarity&lt;br/&gt;- detect ambiguity across stories&lt;br/&gt;- maintain design consistency&lt;br/&gt;- summarise decisions&lt;br/&gt;- keep documentation aligned&lt;/td&gt;
&lt;td&gt;This is a different category of improvement.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot reduce waiting time between roles&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Most delays in delivery come from:&lt;/strong&gt;&lt;br/&gt;- waiting for a review&lt;br/&gt;- waiting for clarification&lt;br/&gt;- waiting for a decision&lt;br/&gt;- waiting for a fix&lt;br/&gt;- waiting for alignment&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;A team level AI can:&lt;/strong&gt;&lt;br/&gt;- answer clarifying questions&lt;br/&gt;- surface missing information&lt;br/&gt;- propose decisions&lt;br/&gt;- highlight blockers&lt;br/&gt;- keep flow moving&lt;/td&gt;
&lt;td&gt;This is where the real throughput gains lie.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot coordinate across roles&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;A delivery team includes:&lt;/strong&gt;&lt;br/&gt;- product&lt;br/&gt;- design&lt;br/&gt;- QA&lt;br/&gt;- DevOps&lt;br/&gt;- security&lt;br/&gt;- architecture&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;A team level AI can:&lt;/strong&gt;&lt;br/&gt;- translate between roles&lt;br/&gt;- maintain shared understanding&lt;br/&gt;- track dependencies&lt;br/&gt;- keep everyone aligned&lt;/td&gt;
&lt;td&gt;This is essential for predictable delivery.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual uplift is local; team uplift is structural&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Individual AI improves:&lt;/strong&gt;&lt;br/&gt;- how fast a person works&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Team level AI improves:&lt;/strong&gt;&lt;br/&gt;- how the team works&lt;br/&gt;&lt;br/&gt;The first is additive. The second is multiplicative.&lt;/td&gt;
&lt;td&gt;Team‑level improvements are multiplicative because they affect several people across the team’s communication network, not just the individual who uses the tool.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A team cannot reach the next level of performance without AI that operates on
the shared system, not just the individuals within it.&lt;/p&gt;
&lt;p&gt;When every member of the delivery team becomes faster and clearer in their
part of the system, the throughput of the whole team increases non linearly.&lt;/p&gt;
&lt;h1 id="team-throughput"&gt;Team Throughput&lt;/h1&gt;
&lt;p&gt;Team throughput is shaped by the slowest interaction in the workflow. Delivery
moves when shared activities move: reviews, fixes, integration, decisions,
documentation, coordination, and onboarding.&lt;/p&gt;
&lt;p&gt;Onboarding shows this clearly. A new engineer becomes productive when they
understand the system, the domain, the architecture, the conventions, and the
team’s way of working. These are team level artefacts. AI helps only when the
team applies it to the shared knowledge and processes that support this
learning.&lt;/p&gt;
&lt;h1 id="ai-acceleration"&gt;AI Acceleration&lt;/h1&gt;
&lt;p&gt;AI can speed up every shared activity listed above. These activities are
constraints that the whole team depends on. When they move, the system moves.
The effect is non linear because software delivery is dominated by
interaction rather than individual effort.&lt;/p&gt;
&lt;p&gt;Faster reviews, clearer decisions, and quicker coordination reduce the waiting
time between people, which shortens the entire cycle.&lt;/p&gt;
&lt;h2 id="example-how-reduced-waiting-shortens-the-cycle"&gt;Example: How reduced waiting shortens the cycle&lt;/h2&gt;
&lt;p&gt;Imagine a team working on a small feature. The work passes through five steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Write the change  &lt;/li&gt;
&lt;li&gt;Wait for review  &lt;/li&gt;
&lt;li&gt;Apply fixes  &lt;/li&gt;
&lt;li&gt;Wait for approval  &lt;/li&gt;
&lt;li&gt;Merge and test  &lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="without-team-level-ai"&gt;Without team level AI&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Writing the change: 3 hours  &lt;/li&gt;
&lt;li&gt;Waiting for review: 1 day  &lt;/li&gt;
&lt;li&gt;Fixing comments: 1 hour  &lt;/li&gt;
&lt;li&gt;Waiting for approval: half a day  &lt;/li&gt;
&lt;li&gt;Merging and testing: 2 hours  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The total time is not the 6 hours of work. It is the 1.5 days of waiting
wrapped around it.&lt;/p&gt;
&lt;h3 id="team-level-ai-reduces-waiting"&gt;Team level AI reduces waiting&lt;/h3&gt;
&lt;p&gt;Team level AI helps the reviewer by summarising the change, checking for
risks, and drafting comments. It helps the author by preparing fixes and
clarifications, and by coordinating activity through the five stages.&lt;/p&gt;
&lt;p&gt;The waiting times drop:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Writing the change: 3 hours  &lt;/li&gt;
&lt;li&gt;Waiting for review: 2 hours  &lt;/li&gt;
&lt;li&gt;Fixing comments: 30 minutes  &lt;/li&gt;
&lt;li&gt;Waiting for approval: 1 hour  &lt;/li&gt;
&lt;li&gt;Merging and testing: 2 hours  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The work is still roughly 6 hours, but the waiting has fallen from 1.5 days
to about 5 hours. With an 8 hour day, the cycle drops from 18 hours to 11.&lt;/p&gt;
&lt;h3 id="reducing-idle-time-is-key"&gt;Reducing idle time is key&lt;/h3&gt;
&lt;p&gt;The work has not changed. The gain comes from removing the idle time between
people. Reducing waiting shortens the whole cycle. This is where team level AI
has its strongest effect. It acts on the delays that dominate delivery, not
the small pockets of individual effort.&lt;/p&gt;
&lt;p&gt;When these delays shrink, the system moves more quickly. Reviews happen
sooner, decisions are clearer, fixes flow more easily, and work spends less
time sitting in queues. The improvements are non linear because the team is no
longer held back by the slowest interaction.&lt;/p&gt;
&lt;h1 id="ai-benefits-at-the-team-level"&gt;AI Benefits at the Team Level&lt;/h1&gt;
&lt;p&gt;The gains that matter most cannot be achieved through individual AI use alone.
Individual uplift improves personal speed, but it does not change the
structure of the team’s workflow or the quality of the shared artefacts that
the team relies on.&lt;/p&gt;
&lt;p&gt;Team level performance improves only when AI is applied directly to the
collective work: shaping requirements, coordinating plans, reviewing code,
integrating changes, resolving ambiguity, documenting decisions, and keeping
flow steady.&lt;/p&gt;
&lt;p&gt;These activities form the delivery system. Improving them requires AI that
operates at the level of the team rather than the individual.&lt;/p&gt;
&lt;h1 id="why-team-ai-is-necessary"&gt;Why Team AI is Necessary&lt;/h1&gt;
&lt;p&gt;Individual uplift improves the outputs that flow into team interactions. It
does not improve the interactions themselves. The main bottlenecks in delivery
are the points where people must work together: clarifying requirements,
resolving ambiguity, negotiating trade offs, coordinating across roles, and
maintaining shared understanding.&lt;/p&gt;
&lt;p&gt;Individual AI helps a person contribute more quickly. Team level AI improves
the clarity, accuracy, and speed of the shared work that binds the team
together. This is where the real gains lie.&lt;/p&gt;
&lt;h1 id="team-level-ai"&gt;Team level AI&lt;/h1&gt;
&lt;p&gt;A team level AI agent can work on the shared system:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;rewrite requirements for clarity  &lt;/li&gt;
&lt;li&gt;maintain architecture knowledge  &lt;/li&gt;
&lt;li&gt;surface risks  &lt;/li&gt;
&lt;li&gt;detect ambiguity  &lt;/li&gt;
&lt;li&gt;summarise decisions  &lt;/li&gt;
&lt;li&gt;generate consistent patterns  &lt;/li&gt;
&lt;li&gt;keep the team aligned  &lt;/li&gt;
&lt;li&gt;handle coordination and scheduling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Individual AI cannot do this because it has no view of the team’s shared
context.&lt;/p&gt;
&lt;h1 id="individual-ai-cannot-coordinate-across-roles"&gt;Individual AI cannot coordinate across roles&lt;/h1&gt;
&lt;p&gt;A delivery team includes product, design, QA, DevOps, security, architecture,
and delivery management. Each role uses different tools and produces different
artefacts. Individual AI tools do not coordinate across these boundaries.&lt;/p&gt;
&lt;p&gt;A team level AI agent can maintain shared context, track dependencies, surface
risks, ensure consistency, support the Agile process, and reduce coordination
friction.&lt;/p&gt;
&lt;h1 id="team-level-uplift-is-a-multiplier"&gt;Team level uplift is a multiplier&lt;/h1&gt;
&lt;p&gt;Individual uplift is additive. It makes each person faster, but it does not
change the structure of the system. Team level uplift is multiplicative. It
changes the structure of the system, reduces shared constraints, collapses
waiting time, improves flow, and increases throughput &lt;em&gt;across the whole team&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This is why team level AI is required to unlock the full return on investment.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;The shift to AI in software engineering will not be won through individual
adoption alone. Teams already feel the lift from faster coding and quicker
local tasks, but the real gains come when AI is applied to the shared work that
governs how delivery actually happens. The constraints that slow teams down are
collective, and so the improvements that matter must be collective as well.&lt;/p&gt;
&lt;p&gt;The organisations that move first will be the ones that treat AI as part of
their delivery system, not as a personal tool. They will use it to keep work
flowing, reduce waiting, maintain shared understanding, and support the
decisions that shape the product. Once AI is embedded at this level, the team’s
throughput changes in a way that individual uplift can never reach.&lt;/p&gt;
&lt;p&gt;The opportunity is simple. Teams that adopt AI together will outpace those that
adopt it alone. The sooner a team treats AI as part of its operating model, the
sooner it sees the return that individual tools cannot deliver.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="team-ai-is-the-next-step.html"&gt;Individual AI delivers diminishing returns; meaningful improvement comes from strengthening the collective workflow.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="building-safe-llm-systems.html"&gt;AI systems behave differently from traditional software and require layered safety, strong governance, observability, and architectural discipline to operate reliably and sustainably.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="when-code-is-cheap.html"&gt;AI lowers the cost of code, not the cost of thinking. Clarity and judgement, not speed, determine whether teams build what truly matters.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#modern-software-is-delivered-by-teams"&gt;Modern Software is delivered by Teams&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-an-engineer-does"&gt;What an Engineer Does&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#software-engineer-adoption-of-ai-is-individual"&gt;Software Engineer Adoption of AI is Individual&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#accelerate-one-accelerate-many"&gt;Accelerate One, Accelerate Many&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-throughput"&gt;Team Throughput&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ai-acceleration"&gt;AI Acceleration&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#example-how-reduced-waiting-shortens-the-cycle"&gt;Example: How reduced waiting shortens the cycle&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#without-team-level-ai"&gt;Without team level AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-level-ai-reduces-waiting"&gt;Team level AI reduces waiting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#reducing-idle-time-is-key"&gt;Reducing idle time is key&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#ai-benefits-at-the-team-level"&gt;AI Benefits at the Team Level&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#why-team-ai-is-necessary"&gt;Why Team AI is Necessary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-level-ai"&gt;Team level AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#individual-ai-cannot-coordinate-across-roles"&gt;Individual AI cannot coordinate across roles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-level-uplift-is-a-multiplier"&gt;Team level uplift is a multiplier&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Brooks, F. P. (1975). The Mythical Man Month&lt;br/&gt;
  https://www.pearson.com/en-gb/subject-catalog/p/mythical-man-month/P200000003808/9780201835953&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;GitHub — The Economic Impact of GitHub Copilot&lt;br/&gt;
  https://github.blog/news-insights/research/the-economic-impact-of-github-copilot/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;JetBrains AI Pulse Report 2026&lt;br/&gt;
  https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;McKinsey &amp;amp; Company — Unleashing developer productivity with generative AI&lt;br/&gt;
  https://www.mckinsey.com/capabilities/quantumblack/our-insights/unleashing-developer-productivity-with-generative-ai&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;McKinsey &amp;amp; Company — Yes, you can measure software developer productivity&lt;br/&gt;
  https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/yes-you-can-measure-software-developer-productivity&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Microsoft AI Economy Institute — AI Diffusion and Productivity&lt;br/&gt;
  https://www.microsoft.com/en-us/research/group/aiei/ai-diffusion/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Stanford HAI — The AI Index Report 2024&lt;br/&gt;
  https://aiindex.stanford.edu/report/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Stripe — The Developer Coefficient (with Harris Poll)&lt;br/&gt;
  https://stripe.com/reports/developer-coefficient-2018&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="Leadership"></category></entry><entry><title>Team-Based AI Engineering is Next Step After Individual AI for Coding</title><link href="https://phroneses.com/articles/build/notes/ai-engineering-must-be-team-based-to-see-significant-roi-for-engineers.html" rel="alternate"></link><published>2026-05-05T00:00:00+00:00</published><updated>2026-05-05T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-05:/articles/build/notes/ai-engineering-must-be-team-based-to-see-significant-roi-for-engineers.html</id><summary type="html">&lt;p&gt;The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Modern software teams are already moving faster because individual engineers
use AI. Yet the real gains are still ahead. The biggest improvements do not
come from speeding up coding. They come from speeding up the work that happens
between people. That is where most of the time is lost, and where AI has the
greatest leverage when applied at the level of the team.&lt;/p&gt;
&lt;p&gt;A software engineer using AI increases their coding speed by 30 to 75 percent.
But coding is only 30 percent of the job. The remaining 70 percent is the work
that makes coding possible, safe, and correct. This work is shared, and it is
deeply tied to the rest of the team.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Requirements, clarification and planning (15 to 20 percent)  &lt;/li&gt;
&lt;li&gt;Meetings and coordination (10 to 15 percent)  &lt;/li&gt;
&lt;li&gt;Code review (10 to 15 percent)  &lt;/li&gt;
&lt;li&gt;Debugging, testing, and validation (15 to 20 percent)  &lt;/li&gt;
&lt;li&gt;DevOps, tooling, and environment work (5 to 10 percent)  &lt;/li&gt;
&lt;li&gt;Documentation and knowledge work (5 to 10 percent)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These figures come from McKinsey, GitHub, Stripe, and Harris Poll. They show
that most of an engineer’s time is spent on team‑level activities.&lt;/p&gt;
&lt;h1 id="modern-software-is-delivered-by-teams"&gt;Modern Software is delivered by Teams&lt;/h1&gt;
&lt;p&gt;These twelve activities shape team throughput. Every delivery team performs
them, and they determine how quickly and safely software moves from idea to
production.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Activities&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. Understand and Shape Work&lt;/td&gt;
&lt;td&gt;- Product discovery&lt;br/&gt;- Prioritisation&lt;br/&gt;- Requirements shaping&lt;br/&gt;- Trade off decisions&lt;br/&gt;- Roadmapping&lt;br/&gt;- Forecasting&lt;/td&gt;
&lt;td&gt;This is where the team decides what to build and why.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. Plan and Coordinate Delivery&lt;/td&gt;
&lt;td&gt;- Sprint planning&lt;br/&gt;- Iteration planning&lt;br/&gt;- Capacity planning&lt;br/&gt;- Cross team alignment&lt;br/&gt;- Risk identification&lt;br/&gt;- Risk mitigation&lt;/td&gt;
&lt;td&gt;This is the team level coordination layer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. Design the Solution&lt;/td&gt;
&lt;td&gt;- Architecture design&lt;br/&gt;- System design&lt;br/&gt;- API design&lt;br/&gt;- Interface design&lt;br/&gt;- Technical decisions&lt;br/&gt;- Design documentation&lt;/td&gt;
&lt;td&gt;This is where the team decides how to build it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. Build the Solution&lt;/td&gt;
&lt;td&gt;- Coding&lt;br/&gt;- Test creation&lt;br/&gt;- Refactoring&lt;br/&gt;- Local environment work&lt;/td&gt;
&lt;td&gt;This is the implementation phase.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5. Validate and Integrate&lt;/td&gt;
&lt;td&gt;- Code reviews&lt;br/&gt;- Automated testing&lt;br/&gt;- Manual testing&lt;br/&gt;- Integration workflows&lt;br/&gt;- Merge workflows&lt;/td&gt;
&lt;td&gt;This is the quality and integration gate.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. Iterate and Fix&lt;/td&gt;
&lt;td&gt;- Debugging&lt;br/&gt;- Fixing test failures&lt;br/&gt;- Addressing review comments&lt;br/&gt;- Retesting&lt;/td&gt;
&lt;td&gt;This is the iteration loop.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7. Deploy and Operate&lt;/td&gt;
&lt;td&gt;- Release management&lt;br/&gt;- Monitoring&lt;br/&gt;- Observability&lt;br/&gt;- Incident response&lt;br/&gt;- On call operations&lt;/td&gt;
&lt;td&gt;This is the operational responsibility layer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8. Learn and Improve&lt;/td&gt;
&lt;td&gt;- Retrospectives&lt;br/&gt;- Post incident reviews&lt;br/&gt;- Process improvement&lt;br/&gt;- Tooling upgrades&lt;/td&gt;
&lt;td&gt;This is how the team improves its delivery system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9. Maintain Flow&lt;/td&gt;
&lt;td&gt;- Manage work in progress&lt;br/&gt;- Unblock teammates&lt;br/&gt;- Reduce handoff delays&lt;br/&gt;- Remove bottlenecks&lt;/td&gt;
&lt;td&gt;This is the team’s ability to maintain throughput.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10. Manage Team Knowledge&lt;/td&gt;
&lt;td&gt;- Documentation&lt;br/&gt;- Architecture knowledge&lt;br/&gt;- Domain knowledge&lt;br/&gt;- Onboarding new engineers&lt;/td&gt;
&lt;td&gt;This is the team’s collective memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11. Communicate and Align&lt;/td&gt;
&lt;td&gt;- Stakeholder updates&lt;br/&gt;- Status reports&lt;br/&gt;- Cross team communication&lt;br/&gt;- Decision logging&lt;/td&gt;
&lt;td&gt;This is the communication layer that keeps the system coherent.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12. Govern and Ensure Compliance&lt;/td&gt;
&lt;td&gt;- Security reviews&lt;br/&gt;- Regulatory compliance&lt;br/&gt;- Data governance&lt;br/&gt;- Risk management&lt;/td&gt;
&lt;td&gt;This is essential in regulated, cloud native environments.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These twelve activities define how modern software is delivered. Every engineer
contributes to them, but not in equal measure. To understand where AI creates
leverage, we need to look at how an engineer’s time maps onto this system. That
is what the next section describes.&lt;/p&gt;
&lt;h1 id="what-an-engineer-does"&gt;What an Engineer Does&lt;/h1&gt;
&lt;p&gt;The work of an engineer is given in the &lt;em&gt;Engineer Time&lt;/em&gt; column, their work feeding into
the team activities described in column two.&lt;/p&gt;
&lt;style&gt;
  :root {
    --row-highlight: #e0e0e0;
  }
&lt;/style&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engineer Time&lt;/th&gt;
&lt;th&gt;Team Activities&lt;/th&gt;
&lt;th&gt;Why this is Necessary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Requirements, clarification, planning&lt;/td&gt;
&lt;td&gt;1. Understand and Shape Work;&lt;br/&gt;2. Plan and Coordinate;&lt;br/&gt;
3. Design the Solution;&lt;br/&gt;11. Communicate and Align&lt;/td&gt;
&lt;td&gt;Engineers must understand the problem, shape requirements, and make
trade offs before design.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Meetings and coordination&lt;/td&gt;
&lt;td&gt;2. Plan and Coordinate;&lt;br/&gt;9. Maintain Flow;&lt;br/&gt;
11. Communicate and Align;&lt;br/&gt;12. Govern and Ensure Compliance&lt;/td&gt;
&lt;td&gt;Coordination keeps work flowing, dependencies managed, and compliance
aligned.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr style="background-color: var(--row-highlight);"&gt;
&lt;td&gt;Coding&lt;/td&gt;
&lt;td&gt;4. Build the Solution&lt;/td&gt;
&lt;td&gt;Engineers turn all the work thus far into working computer code, using
business infrastructure, processes and standards.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;5. Validate and Integrate;&lt;br/&gt;6. Iterate and Fix;&lt;br/&gt;
10. Manage Team Knowledge&lt;/td&gt;
&lt;td&gt;Code review is the quality gate, integration control point, and
knowledge sharing mechanism.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging, testing, validation&lt;/td&gt;
&lt;td&gt;4. Build the Solution;&lt;br/&gt;5. Validate and Integrate;&lt;br/&gt;
6. Iterate and Fix;&lt;br/&gt;7. Deploy and Operate&lt;/td&gt;
&lt;td&gt;Debugging and validation dominate the iteration loop and ensure
correctness end to end.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr style="background-color: var(--row-highlight);"&gt;
&lt;td&gt;DevOps, tooling, environment work&lt;/td&gt;
&lt;td&gt;4. Build the Solution;&lt;br/&gt;7. Deploy and Operate;&lt;br/&gt;
8. Learn and Improve;&lt;br/&gt;9. Maintain Flow&lt;/td&gt;
&lt;td&gt;Tooling and environment work underpin build stability, deployment
reliability, and flow.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documentation and knowledge work&lt;/td&gt;
&lt;td&gt;1. Understand and Shape Work;&lt;br/&gt;3. Design the Solution;&lt;br/&gt;
10. Manage Team Knowledge;&lt;br/&gt;11. Communicate and Align&lt;/td&gt;
&lt;td&gt;Documentation is the team’s shared memory and design clarity
mechanism.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The two hghlighted rows show the "coding" step, that is predominantly done by
the software engineer alone.&lt;/p&gt;
&lt;p&gt;Coding is the final expression of a much larger collaborative effort. The
other 70 percent of the role ensures that what is coded is the right thing,
built the right way, that is safe to run in production.&lt;/p&gt;
&lt;h1 id="software-engineer-adoption-of-ai-is-individual"&gt;Software Engineer Adoption of AI is Individual&lt;/h1&gt;
&lt;p&gt;Developers are adopting AI tools on their own, at scale, and ahead of their
organisations. JetBrains reports that 90 percent of developers now use at
least one AI tool at work, and 74 percent have adopted specialised assistants
independently. GitHub finds the same pattern: engineers use AI to improve
their own speed and reduce cognitive load, not to change team workflows.&lt;/p&gt;
&lt;p&gt;The result is a widening gap between personal productivity and the unchanged
delivery system that the individuals operate within.&lt;/p&gt;
&lt;h1 id="accelerate-one-accelerate-many"&gt;Accelerate One, Accelerate Many&lt;/h1&gt;
&lt;p&gt;When AI speeds up one engineer, it speeds up the interactions around them:
reviews, iteration loops, testing throughput, coordination, and decision
making. These effects compound across the delivery system.&lt;/p&gt;
&lt;p&gt;Yet individual AI only improves the local interactions that depend on that
engineer. Team level AI improves the global interactions that depend on shared
context, shared artefacts, and shared decision making.&lt;/p&gt;
&lt;p&gt;A team benefits from individual uplift, but several categories of work cannot
be improved by individual tools alone.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section Title&lt;/th&gt;
&lt;th&gt;Activities&lt;/th&gt;
&lt;th&gt;Summary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot see or manage the team’s shared context&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;An engineer’s AI assistant only sees:&lt;/strong&gt;&lt;br/&gt;- the engineer’s code&lt;br/&gt;- the engineer’s tasks&lt;br/&gt;- the engineer’s local context&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;It cannot see:&lt;/strong&gt;&lt;br/&gt;- the team’s backlog&lt;br/&gt;- the team’s dependencies&lt;br/&gt;- the team’s decisions&lt;br/&gt;- the team’s risks&lt;br/&gt;- the team’s architecture&lt;br/&gt;- the team’s workflow state&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Without this shared view, individual AI cannot improve:&lt;/strong&gt;&lt;br/&gt;- planning&lt;br/&gt;- coordination&lt;br/&gt;- cross team alignment&lt;br/&gt;- decision logging&lt;br/&gt;- risk management&lt;/td&gt;
&lt;td&gt;These are team level responsibilities, and they remain untouched.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot improve the quality of shared artefacts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Even if every engineer uses AI, the team still has:&lt;/strong&gt;&lt;br/&gt;- unclear requirements&lt;br/&gt;- inconsistent designs&lt;br/&gt;- missing decision records&lt;br/&gt;- uneven documentation&lt;br/&gt;- fragmented knowledge&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;A team level AI can:&lt;/strong&gt;&lt;br/&gt;- rewrite requirements for clarity&lt;br/&gt;- detect ambiguity across stories&lt;br/&gt;- maintain design consistency&lt;br/&gt;- summarise decisions&lt;br/&gt;- keep documentation aligned&lt;/td&gt;
&lt;td&gt;This is a different category of improvement.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot reduce waiting time between roles&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Most delays in delivery come from:&lt;/strong&gt;&lt;br/&gt;- waiting for a review&lt;br/&gt;- waiting for clarification&lt;br/&gt;- waiting for a decision&lt;br/&gt;- waiting for a fix&lt;br/&gt;- waiting for alignment&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;A team level AI can:&lt;/strong&gt;&lt;br/&gt;- answer clarifying questions&lt;br/&gt;- surface missing information&lt;br/&gt;- propose decisions&lt;br/&gt;- highlight blockers&lt;br/&gt;- keep flow moving&lt;/td&gt;
&lt;td&gt;This is where the real throughput gains lie.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual AI cannot coordinate across roles&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;A delivery team includes:&lt;/strong&gt;&lt;br/&gt;- product&lt;br/&gt;- design&lt;br/&gt;- QA&lt;br/&gt;- DevOps&lt;br/&gt;- security&lt;br/&gt;- architecture&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;A team level AI can:&lt;/strong&gt;&lt;br/&gt;- translate between roles&lt;br/&gt;- maintain shared understanding&lt;br/&gt;- track dependencies&lt;br/&gt;- keep everyone aligned&lt;/td&gt;
&lt;td&gt;This is essential for predictable delivery.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual uplift is local; team uplift is structural&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Individual AI improves:&lt;/strong&gt;&lt;br/&gt;- how fast a person works&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Team level AI improves:&lt;/strong&gt;&lt;br/&gt;- how the team works&lt;br/&gt;&lt;br/&gt;The first is additive. The second is multiplicative.&lt;/td&gt;
&lt;td&gt;Team‑level improvements are multiplicative because they affect several people across the team’s communication network, not just the individual who uses the tool.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A team cannot reach the next level of performance without AI that operates on
the shared system, not just the individuals within it.&lt;/p&gt;
&lt;p&gt;When every member of the delivery team becomes faster and clearer in their
part of the system, the throughput of the whole team increases non linearly.&lt;/p&gt;
&lt;h1 id="team-throughput"&gt;Team Throughput&lt;/h1&gt;
&lt;p&gt;Team throughput is shaped by the slowest interaction in the workflow. Delivery
moves when shared activities move: reviews, fixes, integration, decisions,
documentation, coordination, and onboarding.&lt;/p&gt;
&lt;p&gt;Onboarding shows this clearly. A new engineer becomes productive when they
understand the system, the domain, the architecture, the conventions, and the
team’s way of working. These are team level artefacts. AI helps only when the
team applies it to the shared knowledge and processes that support this
learning.&lt;/p&gt;
&lt;h1 id="ai-acceleration"&gt;AI Acceleration&lt;/h1&gt;
&lt;p&gt;AI can speed up every shared activity listed above. These activities are
constraints that the whole team depends on. When they move, the system moves.
The effect is non linear because software delivery is dominated by
interaction rather than individual effort.&lt;/p&gt;
&lt;p&gt;Faster reviews, clearer decisions, and quicker coordination reduce the waiting
time between people, which shortens the entire cycle.&lt;/p&gt;
&lt;h2 id="example-how-reduced-waiting-shortens-the-cycle"&gt;Example: How reduced waiting shortens the cycle&lt;/h2&gt;
&lt;p&gt;Imagine a team working on a small feature. The work passes through five steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Write the change  &lt;/li&gt;
&lt;li&gt;Wait for review  &lt;/li&gt;
&lt;li&gt;Apply fixes  &lt;/li&gt;
&lt;li&gt;Wait for approval  &lt;/li&gt;
&lt;li&gt;Merge and test  &lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="without-team-level-ai"&gt;Without team level AI&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Writing the change: 3 hours  &lt;/li&gt;
&lt;li&gt;Waiting for review: 1 day  &lt;/li&gt;
&lt;li&gt;Fixing comments: 1 hour  &lt;/li&gt;
&lt;li&gt;Waiting for approval: half a day  &lt;/li&gt;
&lt;li&gt;Merging and testing: 2 hours  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The total time is not the 6 hours of work. It is the 1.5 days of waiting
wrapped around it.&lt;/p&gt;
&lt;h3 id="team-level-ai-reduces-waiting"&gt;Team level AI reduces waiting&lt;/h3&gt;
&lt;p&gt;Team level AI helps the reviewer by summarising the change, checking for
risks, and drafting comments. It helps the author by preparing fixes and
clarifications, and by coordinating activity through the five stages.&lt;/p&gt;
&lt;p&gt;The waiting times drop:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Writing the change: 3 hours  &lt;/li&gt;
&lt;li&gt;Waiting for review: 2 hours  &lt;/li&gt;
&lt;li&gt;Fixing comments: 30 minutes  &lt;/li&gt;
&lt;li&gt;Waiting for approval: 1 hour  &lt;/li&gt;
&lt;li&gt;Merging and testing: 2 hours  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The work is still roughly 6 hours, but the waiting has fallen from 1.5 days
to about 5 hours. With an 8 hour day, the cycle drops from 18 hours to 11.&lt;/p&gt;
&lt;h3 id="reducing-idle-time-is-key"&gt;Reducing idle time is key&lt;/h3&gt;
&lt;p&gt;The work has not changed. The gain comes from removing the idle time between
people. Reducing waiting shortens the whole cycle. This is where team level AI
has its strongest effect. It acts on the delays that dominate delivery, not
the small pockets of individual effort.&lt;/p&gt;
&lt;p&gt;When these delays shrink, the system moves more quickly. Reviews happen
sooner, decisions are clearer, fixes flow more easily, and work spends less
time sitting in queues. The improvements are non linear because the team is no
longer held back by the slowest interaction.&lt;/p&gt;
&lt;h1 id="ai-benefits-at-the-team-level"&gt;AI Benefits at the Team Level&lt;/h1&gt;
&lt;p&gt;The gains that matter most cannot be achieved through individual AI use alone.
Individual uplift improves personal speed, but it does not change the
structure of the team’s workflow or the quality of the shared artefacts that
the team relies on.&lt;/p&gt;
&lt;p&gt;Team level performance improves only when AI is applied directly to the
collective work: shaping requirements, coordinating plans, reviewing code,
integrating changes, resolving ambiguity, documenting decisions, and keeping
flow steady.&lt;/p&gt;
&lt;p&gt;These activities form the delivery system. Improving them requires AI that
operates at the level of the team rather than the individual.&lt;/p&gt;
&lt;h1 id="why-team-ai-is-necessary"&gt;Why Team AI is Necessary&lt;/h1&gt;
&lt;p&gt;Individual uplift improves the outputs that flow into team interactions. It
does not improve the interactions themselves. The main bottlenecks in delivery
are the points where people must work together: clarifying requirements,
resolving ambiguity, negotiating trade offs, coordinating across roles, and
maintaining shared understanding.&lt;/p&gt;
&lt;p&gt;Individual AI helps a person contribute more quickly. Team level AI improves
the clarity, accuracy, and speed of the shared work that binds the team
together. This is where the real gains lie.&lt;/p&gt;
&lt;h1 id="team-level-ai"&gt;Team level AI&lt;/h1&gt;
&lt;p&gt;A team level AI agent can work on the shared system:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;rewrite requirements for clarity  &lt;/li&gt;
&lt;li&gt;maintain architecture knowledge  &lt;/li&gt;
&lt;li&gt;surface risks  &lt;/li&gt;
&lt;li&gt;detect ambiguity  &lt;/li&gt;
&lt;li&gt;summarise decisions  &lt;/li&gt;
&lt;li&gt;generate consistent patterns  &lt;/li&gt;
&lt;li&gt;keep the team aligned  &lt;/li&gt;
&lt;li&gt;handle coordination and scheduling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Individual AI cannot do this because it has no view of the team’s shared
context.&lt;/p&gt;
&lt;h1 id="individual-ai-cannot-coordinate-across-roles"&gt;Individual AI cannot coordinate across roles&lt;/h1&gt;
&lt;p&gt;A delivery team includes product, design, QA, DevOps, security, architecture,
and delivery management. Each role uses different tools and produces different
artefacts. Individual AI tools do not coordinate across these boundaries.&lt;/p&gt;
&lt;p&gt;A team level AI agent can maintain shared context, track dependencies, surface
risks, ensure consistency, support the Agile process, and reduce coordination
friction.&lt;/p&gt;
&lt;h1 id="team-level-uplift-is-a-multiplier"&gt;Team level uplift is a multiplier&lt;/h1&gt;
&lt;p&gt;Individual uplift is additive. It makes each person faster, but it does not
change the structure of the system. Team level uplift is multiplicative. It
changes the structure of the system, reduces shared constraints, collapses
waiting time, improves flow, and increases throughput &lt;em&gt;across the whole team&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This is why team level AI is required to unlock the full return on investment.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;The shift to AI in software engineering will not be won through individual
adoption alone. Teams already feel the lift from faster coding and quicker
local tasks, but the real gains come when AI is applied to the shared work that
governs how delivery actually happens. The constraints that slow teams down are
collective, and so the improvements that matter must be collective as well.&lt;/p&gt;
&lt;p&gt;The organisations that move first will be the ones that treat AI as part of
their delivery system, not as a personal tool. They will use it to keep work
flowing, reduce waiting, maintain shared understanding, and support the
decisions that shape the product. Once AI is embedded at this level, the team’s
throughput changes in a way that individual uplift can never reach.&lt;/p&gt;
&lt;p&gt;The opportunity is simple. Teams that adopt AI together will outpace those that
adopt it alone. The sooner a team treats AI as part of its operating model, the
sooner it sees the return that individual tools cannot deliver.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="agents-cannot-maintain-systems.html"&gt;LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="engineers-need-to-know.html"&gt;Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai.html"&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#modern-software-is-delivered-by-teams"&gt;Modern Software is delivered by Teams&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-an-engineer-does"&gt;What an Engineer Does&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#software-engineer-adoption-of-ai-is-individual"&gt;Software Engineer Adoption of AI is Individual&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#accelerate-one-accelerate-many"&gt;Accelerate One, Accelerate Many&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-throughput"&gt;Team Throughput&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ai-acceleration"&gt;AI Acceleration&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#example-how-reduced-waiting-shortens-the-cycle"&gt;Example: How reduced waiting shortens the cycle&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#without-team-level-ai"&gt;Without team level AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-level-ai-reduces-waiting"&gt;Team level AI reduces waiting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#reducing-idle-time-is-key"&gt;Reducing idle time is key&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#ai-benefits-at-the-team-level"&gt;AI Benefits at the Team Level&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#why-team-ai-is-necessary"&gt;Why Team AI is Necessary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-level-ai"&gt;Team level AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#individual-ai-cannot-coordinate-across-roles"&gt;Individual AI cannot coordinate across roles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#team-level-uplift-is-a-multiplier"&gt;Team level uplift is a multiplier&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Brooks, F. P. (1975). The Mythical Man Month&lt;br/&gt;
  https://www.pearson.com/en-gb/subject-catalog/p/mythical-man-month/P200000003808/9780201835953&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;GitHub — The Economic Impact of GitHub Copilot&lt;br/&gt;
  https://github.blog/news-insights/research/the-economic-impact-of-github-copilot/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;JetBrains AI Pulse Report 2026&lt;br/&gt;
  https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;McKinsey &amp;amp; Company — Unleashing developer productivity with generative AI&lt;br/&gt;
  https://www.mckinsey.com/capabilities/quantumblack/our-insights/unleashing-developer-productivity-with-generative-ai&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;McKinsey &amp;amp; Company — Yes, you can measure software developer productivity&lt;br/&gt;
  https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/yes-you-can-measure-software-developer-productivity&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Microsoft AI Economy Institute — AI Diffusion and Productivity&lt;br/&gt;
  https://www.microsoft.com/en-us/research/group/aiei/ai-diffusion/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Stanford HAI — The AI Index Report 2024&lt;br/&gt;
  https://aiindex.stanford.edu/report/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Stripe — The Developer Coefficient (with Harris Poll)&lt;br/&gt;
  https://stripe.com/reports/developer-coefficient-2018&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="Build"></category></entry><entry><title>Global AI Trends 2024–2025</title><link href="https://phroneses.com/articles/build/notes/global-ai-trends-2024-2025.html" rel="alternate"></link><published>2026-05-04T00:00:00+00:00</published><updated>2026-05-04T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-04:/articles/build/notes/global-ai-trends-2024-2025.html</id><summary type="html">&lt;p&gt;Global evidence shows rapid AI adoption, rising capability, and widening gaps between regions and firms, with the US driving investment and commercial uptake.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="global-trends-in-ai"&gt;Global Trends in AI&lt;/h1&gt;
&lt;p&gt;Artificial intelligence has entered a new phase. It is no longer a pilot or
proof of concept. AI is core infrastructure; a technology that shapes how
economies operate and how firms compete.&lt;/p&gt;
&lt;p&gt;Evidence from the Microsoft AI Economy Institute (AIEI), Stanford HAI, and
McKinsey shows rapid adoption and a widening gap between leaders and others.
What follows is a concise summary of the period from 2024 to 2025, based solely
on verified and reliable evidence.&lt;/p&gt;
&lt;p&gt;The global evidence shows fast adoption, rising capability, and a widening gap
between regions. These patterns set the context for the country level picture,
where the United States remains a major driver of development, investment, and
commercial uptake.&lt;/p&gt;
&lt;h1 id="global-picture"&gt;Global picture&lt;/h1&gt;
&lt;h2 id="global-adoption-and-diffusion"&gt;Global adoption and diffusion&lt;/h2&gt;
&lt;p&gt;The AIEI reports that roughly one in six people worldwide used a generative AI
tool in the second half of 2025. The same study states that 24.7 percent of the
working age population in the Global North used generative AI tools, compared
with 14.1 percent in the Global South. The AIEI attributes this gap to
differences in infrastructure, skills, and policy readiness.&lt;/p&gt;
&lt;h2 id="commercial-traction-and-investment"&gt;Commercial traction and investment&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 notes that 44 percent of United States businesses
paid for AI tools in 2025, up from 5 percent in 2023. UNCTAD in its 2023
Technology and Innovation Report confirms strong global growth in AI related
companies and investment, especially in economies with established technology
sectors and supportive policy environments.&lt;/p&gt;
&lt;h2 id="conclusions"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The global evidence points to three clear conclusions.  &lt;/p&gt;
&lt;p&gt;First, AI use is now widespread. McKinsey reports that 88 percent of firms use
AI in at least one function, though most have yet to scale it across the
enterprise.  &lt;/p&gt;
&lt;p&gt;Second, capability continues to rise. Stanford HAI shows sharp year‑on‑year
improvements in benchmark performance and a steep fall in model‑usage costs.  &lt;/p&gt;
&lt;p&gt;Third, investment is concentrated. The United States leads private AI
investment, with China closing the performance gap in model quality.&lt;/p&gt;
&lt;h2 id="in-the-future"&gt;In the Future&lt;/h2&gt;
&lt;p&gt;The verified evidence suggests three grounded developments.  &lt;/p&gt;
&lt;p&gt;First, wider business uptake is likely. McKinsey finds most organisations are
still in pilot mode, implying further diffusion as workflows are redesigned.  &lt;/p&gt;
&lt;p&gt;Second, capability gaps between regions may widen. The AIEI reports higher
adoption in the Global North, driven by infrastructure and skills, and Stanford
HAI shows the United States and China pulling ahead in model development.  &lt;/p&gt;
&lt;p&gt;Third, investment patterns point to continued commercialisation. Stanford HAI
records strong private investment in generative AI, with the United States far
ahead of other economies.&lt;/p&gt;
&lt;p&gt;These trends indicate a maturing technology, uneven readiness across regions,
and a period where firms that can integrate AI into workflows will move faster
than those still experimenting.&lt;/p&gt;
&lt;h1 id="north-america"&gt;North America&lt;/h1&gt;
&lt;h2 id="united-states"&gt;United States&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 reports that United States organisations continue
to lead in frontier model (LLM) development and commercialisation. The AIEI
diffusion study places the United States 24th globally for working age usage of
generative AI tools, at 28.3 percent. The Federal Reserve Board in its 2026
FEDS Note reports high AI adoption in United States professional services and
financial services.&lt;/p&gt;
&lt;h2 id="canada-and-mexico"&gt;Canada and Mexico&lt;/h2&gt;
&lt;p&gt;Statistics Canada reports that 12.2 percent of Canadian firms used AI to produce
goods or deliver services in 2025, with a further 14.5 percent planning to
adopt AI within the following year.&lt;/p&gt;
&lt;p&gt;This reflects a steady rise in enterprise use rather than a population level
diffusion measure.&lt;/p&gt;
&lt;p&gt;Broader policy material, including the Pan Canadian Artificial Intelligence
Strategy and the work of institutes such as Amii, Mila, and Vector, confirms an
active national ecosystem but does not provide quantified adoption metrics.&lt;/p&gt;
&lt;h2 id="mexico"&gt;Mexico&lt;/h2&gt;
&lt;p&gt;The OECD reports that around 20 percent of Mexican firms use at least one AI
technology, but this is a general AI adoption figure, not a generative
AI diffusion metric and is not tied to 2024 to 2025 specifically.&lt;/p&gt;
&lt;h2 id="conclusions_1"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The United States stands out for commercial uptake. In the U.S., public uptake
is clearly more advanced, with clearer evidence of scale and investment.&lt;/p&gt;
&lt;p&gt;Canada’s AI uptake is driven mainly by firms rather than
the general population. The Statistics Canada figures point to a measured,
incremental pattern of adoption, with a clear pipeline of organisations preparing
to introduce AI into their operations. The wider national ecosystem is active,
but the absence of quantified diffusion data means the scale of use beyond the
enterprise level cannot be assessed.&lt;/p&gt;
&lt;p&gt;Mexico’s position is different. The OECD figure shows that a notable share of
firms use at least one AI technology, but the measure is broad and not tied to
generative AI or the 2024–2025 period. The available evidence therefore gives a
sense of adoption but not its depth, maturity, or rate of change.&lt;/p&gt;
&lt;h2 id="looking-to-the-future"&gt;Looking to the Future&lt;/h2&gt;
&lt;h3 id="canada-and-mexico_1"&gt;Canada and Mexico&lt;/h3&gt;
&lt;p&gt;The verified material suggests that Canada’s enterprise‑level adoption is likely
to continue rising, given the proportion of firms planning to adopt AI and the
presence of established research institutes. The lack of population‑level data
remains a gap, limiting visibility of wider diffusion.&lt;/p&gt;
&lt;p&gt;Mexico’s general adoption figure indicates that AI is present across parts of
the economy, but the absence of more granular or time‑specific data makes it
hard to track progress or compare with other regions. Both countries would
benefit from more consistent measurement to understand how adoption evolves over
time.&lt;/p&gt;
&lt;h3 id="the-united-states"&gt;The United States&lt;/h3&gt;
&lt;p&gt;The United States shows a more advanced stage of AI commercialisation than its
neighbours. The scale of paid use indicates that AI has moved beyond trial
activity and is now embedded in day‑to‑day business operations. This reflects a
market where firms are not only experimenting but committing resources and
integrating AI into core workflows.&lt;/p&gt;
&lt;p&gt;The strength of the U.S. research and investment base reinforces this position.
A large share of global private investment, combined with a concentration of
leading model developers, gives the U.S. a structural advantage. This creates a
feedback loop: strong domestic capability supports commercial uptake, and
commercial uptake in turn drives further capability.&lt;/p&gt;
&lt;p&gt;Public use also appears more developed. Higher adoption levels across the
Global North, combined with the U.S. role as a major producer and buyer of AI
systems, point to a broader diffusion of tools into everyday work and consumer
contexts.&lt;/p&gt;
&lt;p&gt;Taken together, the evidence shows an economy where AI is already part of the
operational fabric, supported by deep investment, strong research output, and a
business environment that moves quickly from experimentation to deployment.&lt;/p&gt;
&lt;h3 id="how-us-businesses-can-build-on-their-current-position"&gt;How U.S. businesses can build on their current position&lt;/h3&gt;
&lt;p&gt;The evidence shows that the United States holds two structural advantages:
strong commercial uptake and deep private investment. China, by contrast, leads
in large‑scale deployment in specific sectors and in state‑directed industrial
programmes. These differences shape how firms in each country can move.&lt;/p&gt;
&lt;p&gt;For U.S. businesses, the main advantage is speed. The high rate of paid use
means firms are already integrating AI into everyday operations. This allows
them to refine workflows, build internal capability, and compound gains earlier
than competitors. The depth of private investment also gives U.S. firms access
to a broad supply of models, tooling, and infrastructure, which lowers the cost
of experimentation and adoption.&lt;/p&gt;
&lt;p&gt;China’s strength lies in coordinated deployment across priority sectors. This
creates scale quickly, but it also means firms operate within a more directed
innovation environment. U.S. firms, by contrast, benefit from a more open
commercial ecosystem, where competition between providers drives rapid
improvement in tools and services.&lt;/p&gt;
&lt;p&gt;The practical insight is that U.S. businesses can move faster because the
commercial environment rewards early adoption and continuous iteration. They
can integrate AI into products and operations without waiting for sector‑level
programmes or central coordination. This gives them room to differentiate on
execution, workflow design, and customer experience.&lt;/p&gt;
&lt;p&gt;In short, the U.S. position allows firms to take advantage of a mature market,
strong investment flows, and a competitive supply base, while China’s model
favours rapid scaling within targeted sectors. Each system has its strengths,
but the U.S. environment gives individual firms more freedom to act and adapt.&lt;/p&gt;
&lt;h1 id="europe-middle-east-and-africa"&gt;Europe, Middle East and Africa&lt;/h1&gt;
&lt;h2 id="europe"&gt;Europe&lt;/h2&gt;
&lt;p&gt;Euronews in 2026, reporting on Eurostat generative AI usage data, identifies
Norway, Ireland, France, and Spain as leaders in individual level adoption.
Euronews also reports that countries with strong digital infrastructure,
sustained skills investment, and mature employer practices show the highest
usage. The same reporting highlights Europe as an active digital governance
environment, although specific AI laws are not detailed in the confirmed
sources.&lt;/p&gt;
&lt;h2 id="united-kingdom"&gt;United Kingdom&lt;/h2&gt;
&lt;p&gt;The United Kingdom appears consistently in major global analyses as a leading
centre for AI research, policy development, and commercial activity.&lt;/p&gt;
&lt;p&gt;The State of AI Report 2025 highlights the United Kingdom's role in research of
frontier models (LLMs) and safety research.  UNCTAD in its 2023 Technology and
Innovation Report places the United Kingdom among economies with strong
technology sectors and supportive policy environments.&lt;/p&gt;
&lt;h2 id="middle-east"&gt;Middle East&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study identifies the United Arab Emirates as the leading
country per capita globally for working age usage of generative AI tools, at
64.0 percent in late 2025. The same study places Singapore second globally at
60.9 percent. The AIEI attributes these results to early investment in
infrastructure, skills, and government adoption.&lt;/p&gt;
&lt;h2 id="africa"&gt;Africa&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study reports that AI adoption in the Global North has grown
nearly twice as fast as in the Global South. Africa is considered part of the
Global South. The AIEI attributes lower adoption in the Global South to
differences in infrastructure, skills, and policy readiness.&lt;/p&gt;
&lt;h2 id="conclusions_2"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The direction of travel across Europe, the Middle East, and Africa differs
markedly from the paths taken in the United States and China. Europe’s leading
adopters show a pattern built on long‑term institutional strength: digital
infrastructure, skills pipelines, and employer practices that support steady,
broad‑based uptake. This creates a slower but more stable trajectory, shaped by
governance and capability rather than market speed.&lt;/p&gt;
&lt;p&gt;The United Kingdom follows a related but distinct route. Its position is driven
by research depth, frontier model work, and policy activity. This gives the UK
influence in shaping standards and governance, even if its commercial scale is
smaller than that of the United States.&lt;/p&gt;
&lt;p&gt;The Middle East, led by the UAE, shows a different model again. High usage
levels reflect rapid state‑led investment and fast public‑sector adoption. This
is a top‑down route to diffusion, where national strategy translates quickly
into workforce behaviour.&lt;/p&gt;
&lt;p&gt;Africa’s position reflects structural constraints. Lower adoption is tied to
infrastructure, skills, and policy readiness. The pattern is one of uneven
capacity rather than lack of interest or activity.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_1"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;Europe is likely to continue along an institution‑led path, deepening adoption
as digital foundations and skills programmes mature. The UK’s research and
policy strengths position it to shape governance debates and influence global
practice. The Middle East is set to maintain rapid uptake where government
investment remains strong. Africa’s progress will depend on improvements in
infrastructure and skills, which remain the main barriers to wider diffusion.&lt;/p&gt;
&lt;h2 id="contrast-with-the-united-states-and-china"&gt;Contrast with the United States and China&lt;/h2&gt;
&lt;p&gt;The United States moves through commercial scale. Its advantage lies in rapid
enterprise uptake, strong private investment, and a competitive market that
rewards early adoption. Europe, by contrast, advances through governance,
skills, and institutional capacity. The UK sits between the two: commercially
active but anchored in research and policy.&lt;/p&gt;
&lt;p&gt;China’s path is driven by coordinated deployment across priority sectors. This
creates scale quickly, but within a more directed innovation environment. The
Middle East mirrors the speed but not the structure: uptake is fast, but driven
by targeted national investment rather than sector‑level industrial planning.&lt;/p&gt;
&lt;p&gt;In Africa, adoption is limited by structural factors, not by market dynamics or
state‑led programmes. Its direction is one of gradual capacity building rather
than rapid scaling.&lt;/p&gt;
&lt;p&gt;Taken together, EMEA’s direction is shaped by institutions, governance, and
state‑led investment, while the United States advances through market scale and
China through coordinated deployment. Each region moves, but for different
reasons and at different speeds.&lt;/p&gt;
&lt;h1 id="asia"&gt;Asia&lt;/h1&gt;
&lt;h2 id="china"&gt;China&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 notes that Chinese frontier model developers such as
DeepSeek, Qwen, and Kimi have closed much of the performance gap with leading
United States models on reasoning and coding tasks.&lt;/p&gt;
&lt;h2 id="south-korea"&gt;South Korea&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study highlights South Korea's rise from 25th to 18th place
globally in 2025, driven by policy, improved Korean language model performance,
and consumer facing features.&lt;/p&gt;
&lt;h2 id="india-and-japan"&gt;India and Japan&lt;/h2&gt;
&lt;p&gt;India and Japan do not appear in the confirmed AI diffusion rankings published
by the AIEI. The AIEI study provides quantified usage data only for countries
that reached the global leaderboard, and neither India nor Japan is listed.&lt;/p&gt;
&lt;h2 id="singapore"&gt;Singapore&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study ranks Singapore second globally for working age usage
of generative AI tools, at 60.9 percent. The AIEI links this to early
investment in digital infrastructure, AI skilling, and government adoption.&lt;/p&gt;
&lt;h2 id="conclusions_3"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Asia shows several distinct paths that differ from both the United States and
China’s own internal model. China’s frontier developers have narrowed the
performance gap with leading U.S. systems, signalling a region where capability
is rising quickly and where model development is becoming more competitive. This
marks China as a major technical actor rather than only a large‑scale adopter.&lt;/p&gt;
&lt;p&gt;South Korea’s movement up the global diffusion rankings reflects a different
dynamic: steady policy support, improved local‑language model performance, and
consumer‑facing features that drive everyday use. This is a pattern of uptake
built on national coordination and product relevance rather than frontier model
competition.&lt;/p&gt;
&lt;p&gt;Singapore sits at the opposite end of the spectrum from most of the region. Its
very high usage levels show what early investment in infrastructure, skills, and
government adoption can achieve. It is a small but highly capable market where
diffusion is broad and rapid.&lt;/p&gt;
&lt;p&gt;India and Japan’s absence from the confirmed diffusion rankings highlights a
lack of comparable usage data rather than a lack of activity. Without quantified
metrics, their position in the regional landscape cannot be assessed in the same
way as China, South Korea, or Singapore.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_2"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;China is likely to continue strengthening its position in model development,
given the narrowing performance gap and the scale of its domestic ecosystem.&lt;/p&gt;
&lt;p&gt;South Korea’s trajectory suggests further gains where policy, language models,
and consumer products continue to align.&lt;/p&gt;
&lt;p&gt;Singapore’s early‑investment model gives it room to maintain high usage levels
as tools mature.&lt;/p&gt;
&lt;p&gt;India and Japan’s future visibility depends on the availability of consistent
diffusion data.&lt;/p&gt;
&lt;h2 id="contrast-with-the-united-states-and-china_1"&gt;Contrast with the United States and China&lt;/h2&gt;
&lt;p&gt;The United States advances through commercial scale and rapid enterprise
adoption. China advances through coordinated capability building and sector‑led
deployment. Much of Asia outside China follows neither path.&lt;/p&gt;
&lt;p&gt;South Korea and Singapore show targeted national strategies that drive uptake
through infrastructure, skills, and consumer‑level features rather than market
competition or industrial planning.&lt;/p&gt;
&lt;p&gt;Taken together, Asia presents a mixed picture: China as a rising technical
competitor to the United States, South Korea and Singapore as fast‑moving
national adopters, and other major economies with limited measurable diffusion.&lt;/p&gt;
&lt;p&gt;This stands in contrast to the U.S. model of commercial scale and China’s model
of coordinated deployment.&lt;/p&gt;
&lt;h1 id="australasia"&gt;Australasia&lt;/h1&gt;
&lt;h2 id="australia-and-new-zealand"&gt;Australia and New Zealand&lt;/h2&gt;
&lt;p&gt;The Australian Bureau of Statistics reports that 24 percent of Australian
businesses used AI technologies in 2023 to 2024. For New Zealand, Digital Skills
Aotearoa states that 19 percent of organisations were using AI tools in 2023.&lt;/p&gt;
&lt;h2 id="conclusions_4"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Australia and New Zealand show a measured but steady pattern of enterprise‑level
AI uptake. The figures point to two economies where adoption is present across a
meaningful share of organisations, but not yet at the scale seen in the most
rapidly diffusing countries. The pattern is one of gradual integration rather
than rapid acceleration, shaped by existing digital capability and sector
composition.&lt;/p&gt;
&lt;p&gt;The evidence also suggests that both countries are moving from early
experimentation into more routine operational use. The adoption levels recorded
indicate that AI is no longer confined to isolated pilots but is beginning to
appear in day‑to‑day business activity. What remains less clear is the depth of
use within firms and the extent to which adoption is spreading beyond early
movers.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_3"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;The available data points to a likely continuation of this steady trajectory.
Both economies have the digital foundations and organisational structures to
support further uptake as tools mature and become easier to integrate. The
current adoption levels suggest room for growth, particularly as more firms
shift from exploration to implementation.&lt;/p&gt;
&lt;p&gt;Future progress will depend on how quickly organisations can build skills,
update processes, and adapt workflows to make effective use of AI. More
consistent measurement would also help clarify how adoption evolves across
sectors and firm sizes.&lt;/p&gt;
&lt;p&gt;Overall, Australasia appears set for continued, incremental growth in AI use,
driven by practical business needs and supported by existing digital capability.&lt;/p&gt;
&lt;h1 id="latin-america"&gt;Latin America&lt;/h1&gt;
&lt;p&gt;The OECD reports that around 20 percent of Mexican firms use at least one AI
technology. Approximately 15 percent of Brazilian firms report the use of AI
tools. In Chile, OECD statistics show that 12 percent of firms use AI
technologies. Beyond these three countries, the Inter American Development Bank
notes rising AI use across Latin America, especially in financial services and
agriculture, but the IDB does not publish national percentages.&lt;/p&gt;
&lt;h2 id="conclusions_5"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Latin America shows a pattern of steady but uneven enterprise‑level adoption.
The available figures point to a region where AI use is present across major
economies but varies widely in scale. Mexico, Brazil, and Chile each show
meaningful uptake, yet none approach the levels seen in the fastest‑moving
countries globally. The broader regional picture, drawn from IDB material,
suggests that adoption is strongest in sectors with clear operational gains,
notably financial services and agriculture. This indicates a practical,
needs‑driven approach rather than a technology‑led surge.&lt;/p&gt;
&lt;p&gt;The absence of consistent national metrics beyond the three reported countries
highlights a measurement gap. It is difficult to assess the depth or spread of
adoption across the region without comparable data, and the evidence that does
exist points to early‑stage integration rather than widespread diffusion.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_4"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;The current pattern suggests that Latin America is likely to continue along a
sector‑led path, with adoption growing where AI delivers immediate operational
value. Financial services and agriculture are well placed to deepen their use,
given the early signs of traction. Broader uptake will depend on improvements
in digital infrastructure, skills, and measurement, which remain uneven across
the region.&lt;/p&gt;
&lt;p&gt;More consistent reporting would help clarify how adoption evolves and where
gaps remain. As tools become easier to deploy and integrate, there is room for
growth across a wider range of sectors, but the pace will depend on the
underlying capacity of firms and national digital systems.&lt;/p&gt;
&lt;p&gt;Overall, the region shows early movement, concentrated in specific industries,
with scope for further progress as capability and measurement improve.&lt;/p&gt;
&lt;h1 id="cross-cutting-themes"&gt;Cross cutting themes&lt;/h1&gt;
&lt;h2 id="infrastructure-and-skills-as-foundations"&gt;Infrastructure and skills as foundations&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study states that countries investing early in digital
infrastructure, AI skilling, and government adoption now lead global usage
rankings.&lt;/p&gt;
&lt;h2 id="uneven-diffusion-and-a-widening-divide"&gt;Uneven diffusion and a widening divide&lt;/h2&gt;
&lt;p&gt;The AIEI highlights a widening divide between the Global North and the Global
South, with adoption in the Global North growing nearly twice as fast.&lt;/p&gt;
&lt;h2 id="commercial-traction-and-enterprise-demand"&gt;Commercial traction and enterprise demand&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 and UNCTAD 2023 both point to strong commercial
traction and rising enterprise demand.&lt;/p&gt;
&lt;h2 id="governance-safety-and-regulation"&gt;Governance, safety, and regulation&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 notes active regulatory developments and growing
attention to risks associated with highly capable AI systems.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;AI progress in 2024–2025 is accelerating, but unevenly. The UAE and Singapore
show what coordinated national strategy and real‑world deployment can achieve,
while the US, China and Europe continue to shape the frontier through research,
investment and commercialisation.&lt;/p&gt;
&lt;p&gt;The emerging divide is not East vs West, it is between nations operationalising
AI at scale and those still discussing its potential.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="evaluate-ai.html"&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="latency-is-architecural.html"&gt;Most latency comes from retrieval hops and orchestration, not the model; RAG pipelines often recreate microservice-style chatter that slows systems down.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="surface-area.html"&gt;AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#global-trends-in-ai"&gt;Global Trends in AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#global-picture"&gt;Global picture&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#global-adoption-and-diffusion"&gt;Global adoption and diffusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#commercial-traction-and-investment"&gt;Commercial traction and investment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#in-the-future"&gt;In the Future&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#north-america"&gt;North America&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#united-states"&gt;United States&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#canada-and-mexico"&gt;Canada and Mexico&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#mexico"&gt;Mexico&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_1"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future"&gt;Looking to the Future&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#canada-and-mexico_1"&gt;Canada and Mexico&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-united-states"&gt;The United States&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-us-businesses-can-build-on-their-current-position"&gt;How U.S. businesses can build on their current position&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#europe-middle-east-and-africa"&gt;Europe, Middle East and Africa&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#europe"&gt;Europe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#united-kingdom"&gt;United Kingdom&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#middle-east"&gt;Middle East&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#africa"&gt;Africa&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_2"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_1"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#contrast-with-the-united-states-and-china"&gt;Contrast with the United States and China&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#asia"&gt;Asia&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#china"&gt;China&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#south-korea"&gt;South Korea&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#india-and-japan"&gt;India and Japan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#singapore"&gt;Singapore&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_3"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_2"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#contrast-with-the-united-states-and-china_1"&gt;Contrast with the United States and China&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#australasia"&gt;Australasia&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#australia-and-new-zealand"&gt;Australia and New Zealand&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_4"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_3"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#latin-america"&gt;Latin America&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#conclusions_5"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_4"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#cross-cutting-themes"&gt;Cross cutting themes&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#infrastructure-and-skills-as-foundations"&gt;Infrastructure and skills as foundations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#uneven-diffusion-and-a-widening-divide"&gt;Uneven diffusion and a widening divide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#commercial-traction-and-enterprise-demand"&gt;Commercial traction and enterprise demand&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#governance-safety-and-regulation"&gt;Governance, safety, and regulation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Amii (Alberta Machine Intelligence Institute)&lt;br/&gt;
  https://www.amii.ca/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Australian Bureau of Statistics. Business Use of Information Technology&lt;br/&gt;
  https://www.abs.gov.au/statistics/industry/technology-and-innovation/business-use-information-technology/latest-release&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Digital Skills Aotearoa. Digital Skills for Tomorrow's World&lt;br/&gt;
  https://digitalskillsforum.nz/digital-skills-report/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Euronews (2026). "AI use at work in Europe"&lt;br/&gt;
  https://www.euronews.com/next/2026/03/19/ai-use-at-work-in-europe-which-countries-use-generative-ai-tools-most-and-why&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Federal Reserve Board. "Monitoring AI Adoption in the U.S. Economy" (2026)&lt;br/&gt;
  https://www.federalreserve.gov/econres/notes/feds-notes/monitoring-ai-adoption-in-the-u-s-economy-20260403.html?utm_source=microsoft.com&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inter American Development Bank. Digital and AI Transformation&lt;br/&gt;
  https://www.iadb.org/en&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;McKinsey and Company. "The State of AI in 2025"&lt;br/&gt;
  https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Mila (Quebec AI Institute)&lt;br/&gt;
  https://mila.quebec/en/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Microsoft AI Economy Institute. AI Diffusion&lt;br/&gt;
  https://www.microsoft.com/en-us/research/group/aiei/ai-diffusion/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Microsoft AI Economy Institute. "Global AI Adoption in 2025 – A Widening Digital Divide"&lt;br/&gt;
  https://www.microsoft.com/en-us/research/publication/global-ai-adoption-in-2025/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;New Zealand MBIE. Artificial Intelligence Policy&lt;br/&gt;
  https://www.mbie.govt.nz/science-and-technology/it-communications-and-broadband/artificial-intelligence/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;OECD. "The Adoption of Artificial Intelligence in Firms"&lt;br/&gt;
  https://www.oecd.org/en/publications/the-adoption-of-artificial-intelligence-in-firms_f9ef33c3-en/full-report.html&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Pan Canadian Artificial Intelligence Strategy&lt;br/&gt;
  https://ised-isde.canada.ca/site/pan-canadian-artificial-intelligence-strategy/en&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Stanford HAI. "AI Index Report 2024"&lt;br/&gt;
  https://aiindex.stanford.edu/report/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;State of AI Report 2025 (Nathan Benaich)&lt;br/&gt;
  https://www.stateof.ai/2025-report-launch&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Statistics Canada. "Artificial intelligence adoption and productivity in Canada"&lt;br/&gt;
  https://www150.statcan.gc.ca/n1/daily-quotidien/240319/dq240319b-eng.htm&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;UNCTAD. "Technology and Innovation Report 2023"&lt;br/&gt;
  https://unctad.org/publication/technology-and-innovation-report-2023&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Vector Institute&lt;br/&gt;
  https://vectorinstitute.ai/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;World Bank. Digital Adoption Index&lt;br/&gt;
  https://www.worldbank.org/en/publication/wdr2021&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="Build"></category></entry><entry><title>Global AI Trends 2024–2025</title><link href="https://phroneses.com/articles/leadership/notes/global-ai-trends-2024-2025-leadership.html" rel="alternate"></link><published>2026-05-04T00:00:00+00:00</published><updated>2026-05-04T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-05-04:/articles/leadership/notes/global-ai-trends-2024-2025-leadership.html</id><summary type="html">&lt;p&gt;Global evidence shows rapid AI adoption, rising capability, and widening gaps between regions and firms.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="global-trends-in-ai"&gt;Global Trends in AI&lt;/h1&gt;
&lt;p&gt;Artificial intelligence has entered a new phase. It is no longer a pilot or
proof of concept. AI is core infrastructure; a technology that shapes how
economies operate and how firms compete.&lt;/p&gt;
&lt;p&gt;Evidence from the Microsoft AI Economy Institute (AIEI), Stanford HAI, and
McKinsey shows rapid adoption and a widening gap between leaders and others.
What follows is a concise summary of the period from 2024 to 2025, based solely
on verified and reliable evidence.&lt;/p&gt;
&lt;p&gt;The global evidence shows fast adoption, rising capability, and a widening gap
between regions. These patterns set the context for the country level picture,
where the United States remains a major driver of development, investment, and
commercial uptake.&lt;/p&gt;
&lt;h1 id="global-picture"&gt;Global picture&lt;/h1&gt;
&lt;h2 id="global-adoption-and-diffusion"&gt;Global adoption and diffusion&lt;/h2&gt;
&lt;p&gt;The AIEI reports that roughly one in six people worldwide used a generative AI
tool in the second half of 2025. The same study states that 24.7 percent of the
working age population in the Global North used generative AI tools, compared
with 14.1 percent in the Global South. The AIEI attributes this gap to
differences in infrastructure, skills, and policy readiness.&lt;/p&gt;
&lt;h2 id="commercial-traction-and-investment"&gt;Commercial traction and investment&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 notes that 44 percent of United States businesses
paid for AI tools in 2025, up from 5 percent in 2023. UNCTAD in its 2023
Technology and Innovation Report confirms strong global growth in AI related
companies and investment, especially in economies with established technology
sectors and supportive policy environments.&lt;/p&gt;
&lt;h2 id="conclusions"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The global evidence points to three clear conclusions.  &lt;/p&gt;
&lt;p&gt;First, AI use is now widespread. McKinsey reports that 88 percent of firms use
AI in at least one function, though most have yet to scale it across the
enterprise.  &lt;/p&gt;
&lt;p&gt;Second, capability continues to rise. Stanford HAI shows sharp year‑on‑year
improvements in benchmark performance and a steep fall in model‑usage costs.  &lt;/p&gt;
&lt;p&gt;Third, investment is concentrated. The United States leads private AI
investment, with China closing the performance gap in model quality.&lt;/p&gt;
&lt;h2 id="in-the-future"&gt;In the Future&lt;/h2&gt;
&lt;p&gt;The verified evidence suggests three grounded developments.  &lt;/p&gt;
&lt;p&gt;First, wider business uptake is likely. McKinsey finds most organisations are
still in pilot mode, implying further diffusion as workflows are redesigned.  &lt;/p&gt;
&lt;p&gt;Second, capability gaps between regions may widen. The AIEI reports higher
adoption in the Global North, driven by infrastructure and skills, and Stanford
HAI shows the United States and China pulling ahead in model development.  &lt;/p&gt;
&lt;p&gt;Third, investment patterns point to continued commercialisation. Stanford HAI
records strong private investment in generative AI, with the United States far
ahead of other economies.&lt;/p&gt;
&lt;p&gt;These trends indicate a maturing technology, uneven readiness across regions,
and a period where firms that can integrate AI into workflows will move faster
than those still experimenting.&lt;/p&gt;
&lt;h1 id="north-america"&gt;North America&lt;/h1&gt;
&lt;h2 id="united-states"&gt;United States&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 reports that United States organisations continue
to lead in frontier model (LLM) development and commercialisation. The AIEI
diffusion study places the United States 24th globally for working age usage of
generative AI tools, at 28.3 percent. The Federal Reserve Board in its 2026
FEDS Note reports high AI adoption in United States professional services and
financial services.&lt;/p&gt;
&lt;h2 id="canada-and-mexico"&gt;Canada and Mexico&lt;/h2&gt;
&lt;p&gt;Statistics Canada reports that 12.2 percent of Canadian firms used AI to produce
goods or deliver services in 2025, with a further 14.5 percent planning to
adopt AI within the following year.&lt;/p&gt;
&lt;p&gt;This reflects a steady rise in enterprise use rather than a population level
diffusion measure.&lt;/p&gt;
&lt;p&gt;Broader policy material, including the Pan Canadian Artificial Intelligence
Strategy and the work of institutes such as Amii, Mila, and Vector, confirms an
active national ecosystem but does not provide quantified adoption metrics.&lt;/p&gt;
&lt;h2 id="mexico"&gt;Mexico&lt;/h2&gt;
&lt;p&gt;The OECD reports that around 20 percent of Mexican firms use at least one AI
technology, but this is a general AI adoption figure, not a generative
AI diffusion metric and is not tied to 2024 to 2025 specifically.&lt;/p&gt;
&lt;h2 id="conclusions_1"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The United States stands out for commercial uptake. In the U.S., public uptake
is clearly more advanced, with clearer evidence of scale and investment.&lt;/p&gt;
&lt;p&gt;Canada’s AI uptake is driven mainly by firms rather than
the general population. The Statistics Canada figures point to a measured,
incremental pattern of adoption, with a clear pipeline of organisations preparing
to introduce AI into their operations. The wider national ecosystem is active,
but the absence of quantified diffusion data means the scale of use beyond the
enterprise level cannot be assessed.&lt;/p&gt;
&lt;p&gt;Mexico’s position is different. The OECD figure shows that a notable share of
firms use at least one AI technology, but the measure is broad and not tied to
generative AI or the 2024–2025 period. The available evidence therefore gives a
sense of adoption but not its depth, maturity, or rate of change.&lt;/p&gt;
&lt;h2 id="looking-to-the-future"&gt;Looking to the Future&lt;/h2&gt;
&lt;h3 id="canada-and-mexico_1"&gt;Canada and Mexico&lt;/h3&gt;
&lt;p&gt;The verified material suggests that Canada’s enterprise‑level adoption is likely
to continue rising, given the proportion of firms planning to adopt AI and the
presence of established research institutes. The lack of population‑level data
remains a gap, limiting visibility of wider diffusion.&lt;/p&gt;
&lt;p&gt;Mexico’s general adoption figure indicates that AI is present across parts of
the economy, but the absence of more granular or time‑specific data makes it
hard to track progress or compare with other regions. Both countries would
benefit from more consistent measurement to understand how adoption evolves over
time.&lt;/p&gt;
&lt;h3 id="the-united-states"&gt;The United States&lt;/h3&gt;
&lt;p&gt;The United States shows a more advanced stage of AI commercialisation than its
neighbours. The scale of paid use indicates that AI has moved beyond trial
activity and is now embedded in day‑to‑day business operations. This reflects a
market where firms are not only experimenting but committing resources and
integrating AI into core workflows.&lt;/p&gt;
&lt;p&gt;The strength of the U.S. research and investment base reinforces this position.
A large share of global private investment, combined with a concentration of
leading model developers, gives the U.S. a structural advantage. This creates a
feedback loop: strong domestic capability supports commercial uptake, and
commercial uptake in turn drives further capability.&lt;/p&gt;
&lt;p&gt;Public use also appears more developed. Higher adoption levels across the
Global North, combined with the U.S. role as a major producer and buyer of AI
systems, point to a broader diffusion of tools into everyday work and consumer
contexts.&lt;/p&gt;
&lt;p&gt;Taken together, the evidence shows an economy where AI is already part of the
operational fabric, supported by deep investment, strong research output, and a
business environment that moves quickly from experimentation to deployment.&lt;/p&gt;
&lt;h3 id="how-us-businesses-can-build-on-their-current-position"&gt;How U.S. businesses can build on their current position&lt;/h3&gt;
&lt;p&gt;The evidence shows that the United States holds two structural advantages:
strong commercial uptake and deep private investment. China, by contrast, leads
in large‑scale deployment in specific sectors and in state‑directed industrial
programmes. These differences shape how firms in each country can move.&lt;/p&gt;
&lt;p&gt;For U.S. businesses, the main advantage is speed. The high rate of paid use
means firms are already integrating AI into everyday operations. This allows
them to refine workflows, build internal capability, and compound gains earlier
than competitors. The depth of private investment also gives U.S. firms access
to a broad supply of models, tooling, and infrastructure, which lowers the cost
of experimentation and adoption.&lt;/p&gt;
&lt;p&gt;China’s strength lies in coordinated deployment across priority sectors. This
creates scale quickly, but it also means firms operate within a more directed
innovation environment. U.S. firms, by contrast, benefit from a more open
commercial ecosystem, where competition between providers drives rapid
improvement in tools and services.&lt;/p&gt;
&lt;p&gt;The practical insight is that U.S. businesses can move faster because the
commercial environment rewards early adoption and continuous iteration. They
can integrate AI into products and operations without waiting for sector‑level
programmes or central coordination. This gives them room to differentiate on
execution, workflow design, and customer experience.&lt;/p&gt;
&lt;p&gt;In short, the U.S. position allows firms to take advantage of a mature market,
strong investment flows, and a competitive supply base, while China’s model
favours rapid scaling within targeted sectors. Each system has its strengths,
but the U.S. environment gives individual firms more freedom to act and adapt.&lt;/p&gt;
&lt;h1 id="europe-middle-east-and-africa"&gt;Europe, Middle East and Africa&lt;/h1&gt;
&lt;h2 id="europe"&gt;Europe&lt;/h2&gt;
&lt;p&gt;Euronews in 2026, reporting on Eurostat generative AI usage data, identifies
Norway, Ireland, France, and Spain as leaders in individual level adoption.
Euronews also reports that countries with strong digital infrastructure,
sustained skills investment, and mature employer practices show the highest
usage. The same reporting highlights Europe as an active digital governance
environment, although specific AI laws are not detailed in the confirmed
sources.&lt;/p&gt;
&lt;h2 id="united-kingdom"&gt;United Kingdom&lt;/h2&gt;
&lt;p&gt;The United Kingdom appears consistently in major global analyses as a leading
centre for AI research, policy development, and commercial activity.&lt;/p&gt;
&lt;p&gt;The State of AI Report 2025 highlights the United Kingdom's role in research of
frontier models (LLMs) and safety research.  UNCTAD in its 2023 Technology and
Innovation Report places the United Kingdom among economies with strong
technology sectors and supportive policy environments.&lt;/p&gt;
&lt;h2 id="middle-east"&gt;Middle East&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study identifies the United Arab Emirates as the leading
country per capita globally for working age usage of generative AI tools, at
64.0 percent in late 2025. The same study places Singapore second globally at
60.9 percent. The AIEI attributes these results to early investment in
infrastructure, skills, and government adoption.&lt;/p&gt;
&lt;h2 id="africa"&gt;Africa&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study reports that AI adoption in the Global North has grown
nearly twice as fast as in the Global South. Africa is considered part of the
Global South. The AIEI attributes lower adoption in the Global South to
differences in infrastructure, skills, and policy readiness.&lt;/p&gt;
&lt;h2 id="conclusions_2"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The direction of travel across Europe, the Middle East, and Africa differs
markedly from the paths taken in the United States and China. Europe’s leading
adopters show a pattern built on long‑term institutional strength: digital
infrastructure, skills pipelines, and employer practices that support steady,
broad‑based uptake. This creates a slower but more stable trajectory, shaped by
governance and capability rather than market speed.&lt;/p&gt;
&lt;p&gt;The United Kingdom follows a related but distinct route. Its position is driven
by research depth, frontier model work, and policy activity. This gives the UK
influence in shaping standards and governance, even if its commercial scale is
smaller than that of the United States.&lt;/p&gt;
&lt;p&gt;The Middle East, led by the UAE, shows a different model again. High usage
levels reflect rapid state‑led investment and fast public‑sector adoption. This
is a top‑down route to diffusion, where national strategy translates quickly
into workforce behaviour.&lt;/p&gt;
&lt;p&gt;Africa’s position reflects structural constraints. Lower adoption is tied to
infrastructure, skills, and policy readiness. The pattern is one of uneven
capacity rather than lack of interest or activity.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_1"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;Europe is likely to continue along an institution‑led path, deepening adoption
as digital foundations and skills programmes mature. The UK’s research and
policy strengths position it to shape governance debates and influence global
practice. The Middle East is set to maintain rapid uptake where government
investment remains strong. Africa’s progress will depend on improvements in
infrastructure and skills, which remain the main barriers to wider diffusion.&lt;/p&gt;
&lt;h2 id="contrast-with-the-united-states-and-china"&gt;Contrast with the United States and China&lt;/h2&gt;
&lt;p&gt;The United States moves through commercial scale. Its advantage lies in rapid
enterprise uptake, strong private investment, and a competitive market that
rewards early adoption. Europe, by contrast, advances through governance,
skills, and institutional capacity. The UK sits between the two: commercially
active but anchored in research and policy.&lt;/p&gt;
&lt;p&gt;China’s path is driven by coordinated deployment across priority sectors. This
creates scale quickly, but within a more directed innovation environment. The
Middle East mirrors the speed but not the structure: uptake is fast, but driven
by targeted national investment rather than sector‑level industrial planning.&lt;/p&gt;
&lt;p&gt;In Africa, adoption is limited by structural factors, not by market dynamics or
state‑led programmes. Its direction is one of gradual capacity building rather
than rapid scaling.&lt;/p&gt;
&lt;p&gt;Taken together, EMEA’s direction is shaped by institutions, governance, and
state‑led investment, while the United States advances through market scale and
China through coordinated deployment. Each region moves, but for different
reasons and at different speeds.&lt;/p&gt;
&lt;h1 id="asia"&gt;Asia&lt;/h1&gt;
&lt;h2 id="china"&gt;China&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 notes that Chinese frontier model developers such as
DeepSeek, Qwen, and Kimi have closed much of the performance gap with leading
United States models on reasoning and coding tasks.&lt;/p&gt;
&lt;h2 id="south-korea"&gt;South Korea&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study highlights South Korea's rise from 25th to 18th place
globally in 2025, driven by policy, improved Korean language model performance,
and consumer facing features.&lt;/p&gt;
&lt;h2 id="india-and-japan"&gt;India and Japan&lt;/h2&gt;
&lt;p&gt;India and Japan do not appear in the confirmed AI diffusion rankings published
by the AIEI. The AIEI study provides quantified usage data only for countries
that reached the global leaderboard, and neither India nor Japan is listed.&lt;/p&gt;
&lt;h2 id="singapore"&gt;Singapore&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study ranks Singapore second globally for working age usage
of generative AI tools, at 60.9 percent. The AIEI links this to early
investment in digital infrastructure, AI skilling, and government adoption.&lt;/p&gt;
&lt;h2 id="conclusions_3"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Asia shows several distinct paths that differ from both the United States and
China’s own internal model. China’s frontier developers have narrowed the
performance gap with leading U.S. systems, signalling a region where capability
is rising quickly and where model development is becoming more competitive. This
marks China as a major technical actor rather than only a large‑scale adopter.&lt;/p&gt;
&lt;p&gt;South Korea’s movement up the global diffusion rankings reflects a different
dynamic: steady policy support, improved local‑language model performance, and
consumer‑facing features that drive everyday use. This is a pattern of uptake
built on national coordination and product relevance rather than frontier model
competition.&lt;/p&gt;
&lt;p&gt;Singapore sits at the opposite end of the spectrum from most of the region. Its
very high usage levels show what early investment in infrastructure, skills, and
government adoption can achieve. It is a small but highly capable market where
diffusion is broad and rapid.&lt;/p&gt;
&lt;p&gt;India and Japan’s absence from the confirmed diffusion rankings highlights a
lack of comparable usage data rather than a lack of activity. Without quantified
metrics, their position in the regional landscape cannot be assessed in the same
way as China, South Korea, or Singapore.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_2"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;China is likely to continue strengthening its position in model development,
given the narrowing performance gap and the scale of its domestic ecosystem.&lt;/p&gt;
&lt;p&gt;South Korea’s trajectory suggests further gains where policy, language models,
and consumer products continue to align.&lt;/p&gt;
&lt;p&gt;Singapore’s early‑investment model gives it room to maintain high usage levels
as tools mature.&lt;/p&gt;
&lt;p&gt;India and Japan’s future visibility depends on the availability of consistent
diffusion data.&lt;/p&gt;
&lt;h2 id="contrast-with-the-united-states-and-china_1"&gt;Contrast with the United States and China&lt;/h2&gt;
&lt;p&gt;The United States advances through commercial scale and rapid enterprise
adoption. China advances through coordinated capability building and sector‑led
deployment. Much of Asia outside China follows neither path.&lt;/p&gt;
&lt;p&gt;South Korea and Singapore show targeted national strategies that drive uptake
through infrastructure, skills, and consumer‑level features rather than market
competition or industrial planning.&lt;/p&gt;
&lt;p&gt;Taken together, Asia presents a mixed picture: China as a rising technical
competitor to the United States, South Korea and Singapore as fast‑moving
national adopters, and other major economies with limited measurable diffusion.&lt;/p&gt;
&lt;p&gt;This stands in contrast to the U.S. model of commercial scale and China’s model
of coordinated deployment.&lt;/p&gt;
&lt;h1 id="australasia"&gt;Australasia&lt;/h1&gt;
&lt;h2 id="australia-and-new-zealand"&gt;Australia and New Zealand&lt;/h2&gt;
&lt;p&gt;The Australian Bureau of Statistics reports that 24 percent of Australian
businesses used AI technologies in 2023 to 2024. For New Zealand, Digital Skills
Aotearoa states that 19 percent of organisations were using AI tools in 2023.&lt;/p&gt;
&lt;h2 id="conclusions_4"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Australia and New Zealand show a measured but steady pattern of enterprise‑level
AI uptake. The figures point to two economies where adoption is present across a
meaningful share of organisations, but not yet at the scale seen in the most
rapidly diffusing countries. The pattern is one of gradual integration rather
than rapid acceleration, shaped by existing digital capability and sector
composition.&lt;/p&gt;
&lt;p&gt;The evidence also suggests that both countries are moving from early
experimentation into more routine operational use. The adoption levels recorded
indicate that AI is no longer confined to isolated pilots but is beginning to
appear in day‑to‑day business activity. What remains less clear is the depth of
use within firms and the extent to which adoption is spreading beyond early
movers.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_3"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;The available data points to a likely continuation of this steady trajectory.
Both economies have the digital foundations and organisational structures to
support further uptake as tools mature and become easier to integrate. The
current adoption levels suggest room for growth, particularly as more firms
shift from exploration to implementation.&lt;/p&gt;
&lt;p&gt;Future progress will depend on how quickly organisations can build skills,
update processes, and adapt workflows to make effective use of AI. More
consistent measurement would also help clarify how adoption evolves across
sectors and firm sizes.&lt;/p&gt;
&lt;p&gt;Overall, Australasia appears set for continued, incremental growth in AI use,
driven by practical business needs and supported by existing digital capability.&lt;/p&gt;
&lt;h1 id="latin-america"&gt;Latin America&lt;/h1&gt;
&lt;p&gt;The OECD reports that around 20 percent of Mexican firms use at least one AI
technology. Approximately 15 percent of Brazilian firms report the use of AI
tools. In Chile, OECD statistics show that 12 percent of firms use AI
technologies. Beyond these three countries, the Inter American Development Bank
notes rising AI use across Latin America, especially in financial services and
agriculture, but the IDB does not publish national percentages.&lt;/p&gt;
&lt;h2 id="conclusions_5"&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Latin America shows a pattern of steady but uneven enterprise‑level adoption.
The available figures point to a region where AI use is present across major
economies but varies widely in scale. Mexico, Brazil, and Chile each show
meaningful uptake, yet none approach the levels seen in the fastest‑moving
countries globally. The broader regional picture, drawn from IDB material,
suggests that adoption is strongest in sectors with clear operational gains,
notably financial services and agriculture. This indicates a practical,
needs‑driven approach rather than a technology‑led surge.&lt;/p&gt;
&lt;p&gt;The absence of consistent national metrics beyond the three reported countries
highlights a measurement gap. It is difficult to assess the depth or spread of
adoption across the region without comparable data, and the evidence that does
exist points to early‑stage integration rather than widespread diffusion.&lt;/p&gt;
&lt;h2 id="looking-to-the-future_4"&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;The current pattern suggests that Latin America is likely to continue along a
sector‑led path, with adoption growing where AI delivers immediate operational
value. Financial services and agriculture are well placed to deepen their use,
given the early signs of traction. Broader uptake will depend on improvements
in digital infrastructure, skills, and measurement, which remain uneven across
the region.&lt;/p&gt;
&lt;p&gt;More consistent reporting would help clarify how adoption evolves and where
gaps remain. As tools become easier to deploy and integrate, there is room for
growth across a wider range of sectors, but the pace will depend on the
underlying capacity of firms and national digital systems.&lt;/p&gt;
&lt;p&gt;Overall, the region shows early movement, concentrated in specific industries,
with scope for further progress as capability and measurement improve.&lt;/p&gt;
&lt;h1 id="cross-cutting-themes"&gt;Cross cutting themes&lt;/h1&gt;
&lt;h2 id="infrastructure-and-skills-as-foundations"&gt;Infrastructure and skills as foundations&lt;/h2&gt;
&lt;p&gt;The AIEI diffusion study states that countries investing early in digital
infrastructure, AI skilling, and government adoption now lead global usage
rankings.&lt;/p&gt;
&lt;h2 id="uneven-diffusion-and-a-widening-divide"&gt;Uneven diffusion and a widening divide&lt;/h2&gt;
&lt;p&gt;The AIEI highlights a widening divide between the Global North and the Global
South, with adoption in the Global North growing nearly twice as fast.&lt;/p&gt;
&lt;h2 id="commercial-traction-and-enterprise-demand"&gt;Commercial traction and enterprise demand&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 and UNCTAD 2023 both point to strong commercial
traction and rising enterprise demand.&lt;/p&gt;
&lt;h2 id="governance-safety-and-regulation"&gt;Governance, safety, and regulation&lt;/h2&gt;
&lt;p&gt;The State of AI Report 2025 notes active regulatory developments and growing
attention to risks associated with highly capable AI systems.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;AI progress in 2024–2025 is accelerating, but unevenly. The UAE and Singapore
show what coordinated national strategy and real‑world deployment can achieve,
while the US, China and Europe continue to shape the frontier through research,
investment and commercialisation.&lt;/p&gt;
&lt;p&gt;The emerging divide is not East vs West, it is between nations operationalising
AI at scale and those still discussing its potential.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="transforming.html"&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="tech-executives.html"&gt;Executives must treat LLMs as probabilistic systems requiring controls, governance, and new forms of oversight.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai.html"&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#global-trends-in-ai"&gt;Global Trends in AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#global-picture"&gt;Global picture&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#global-adoption-and-diffusion"&gt;Global adoption and diffusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#commercial-traction-and-investment"&gt;Commercial traction and investment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#in-the-future"&gt;In the Future&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#north-america"&gt;North America&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#united-states"&gt;United States&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#canada-and-mexico"&gt;Canada and Mexico&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#mexico"&gt;Mexico&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_1"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future"&gt;Looking to the Future&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#canada-and-mexico_1"&gt;Canada and Mexico&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-united-states"&gt;The United States&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-us-businesses-can-build-on-their-current-position"&gt;How U.S. businesses can build on their current position&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#europe-middle-east-and-africa"&gt;Europe, Middle East and Africa&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#europe"&gt;Europe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#united-kingdom"&gt;United Kingdom&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#middle-east"&gt;Middle East&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#africa"&gt;Africa&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_2"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_1"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#contrast-with-the-united-states-and-china"&gt;Contrast with the United States and China&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#asia"&gt;Asia&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#china"&gt;China&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#south-korea"&gt;South Korea&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#india-and-japan"&gt;India and Japan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#singapore"&gt;Singapore&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_3"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_2"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#contrast-with-the-united-states-and-china_1"&gt;Contrast with the United States and China&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#australasia"&gt;Australasia&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#australia-and-new-zealand"&gt;Australia and New Zealand&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions_4"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_3"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#latin-america"&gt;Latin America&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#conclusions_5"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-to-the-future_4"&gt;Looking to the Future&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#cross-cutting-themes"&gt;Cross cutting themes&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#infrastructure-and-skills-as-foundations"&gt;Infrastructure and skills as foundations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#uneven-diffusion-and-a-widening-divide"&gt;Uneven diffusion and a widening divide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#commercial-traction-and-enterprise-demand"&gt;Commercial traction and enterprise demand&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#governance-safety-and-regulation"&gt;Governance, safety, and regulation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Amii (Alberta Machine Intelligence Institute)&lt;br/&gt;
  https://www.amii.ca/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Australian Bureau of Statistics. Business Use of Information Technology&lt;br/&gt;
  https://www.abs.gov.au/statistics/industry/technology-and-innovation/business-use-information-technology/latest-release&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Digital Skills Aotearoa. Digital Skills for Tomorrow's World&lt;br/&gt;
  https://digitalskillsforum.nz/digital-skills-report/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Euronews (2026). "AI use at work in Europe"&lt;br/&gt;
  https://www.euronews.com/next/2026/03/19/ai-use-at-work-in-europe-which-countries-use-generative-ai-tools-most-and-why&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Federal Reserve Board. "Monitoring AI Adoption in the U.S. Economy" (2026)&lt;br/&gt;
  https://www.federalreserve.gov/econres/notes/feds-notes/monitoring-ai-adoption-in-the-u-s-economy-20260403.html?utm_source=microsoft.com&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inter American Development Bank. Digital and AI Transformation&lt;br/&gt;
  https://www.iadb.org/en&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;McKinsey and Company. "The State of AI in 2025"&lt;br/&gt;
  https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Mila (Quebec AI Institute)&lt;br/&gt;
  https://mila.quebec/en/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Microsoft AI Economy Institute. AI Diffusion&lt;br/&gt;
  https://www.microsoft.com/en-us/research/group/aiei/ai-diffusion/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Microsoft AI Economy Institute. "Global AI Adoption in 2025 – A Widening Digital Divide"&lt;br/&gt;
  https://www.microsoft.com/en-us/research/publication/global-ai-adoption-in-2025/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;New Zealand MBIE. Artificial Intelligence Policy&lt;br/&gt;
  https://www.mbie.govt.nz/science-and-technology/it-communications-and-broadband/artificial-intelligence/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;OECD. "The Adoption of Artificial Intelligence in Firms"&lt;br/&gt;
  https://www.oecd.org/en/publications/the-adoption-of-artificial-intelligence-in-firms_f9ef33c3-en/full-report.html&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Pan Canadian Artificial Intelligence Strategy&lt;br/&gt;
  https://ised-isde.canada.ca/site/pan-canadian-artificial-intelligence-strategy/en&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Stanford HAI. "AI Index Report 2024"&lt;br/&gt;
  https://aiindex.stanford.edu/report/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;State of AI Report 2025 (Nathan Benaich)&lt;br/&gt;
  https://www.stateof.ai/2025-report-launch&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Statistics Canada. "Artificial intelligence adoption and productivity in Canada"&lt;br/&gt;
  https://www150.statcan.gc.ca/n1/daily-quotidien/240319/dq240319b-eng.htm&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;UNCTAD. "Technology and Innovation Report 2023"&lt;br/&gt;
  https://unctad.org/publication/technology-and-innovation-report-2023&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Vector Institute&lt;br/&gt;
  https://vectorinstitute.ai/&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;World Bank. Digital Adoption Index&lt;br/&gt;
  https://www.worldbank.org/en/publication/wdr2021&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="Leadership"></category></entry><entry><title>AI and Brands: A Practical Framework for Protecting and Strengthening Brand Equity</title><link href="https://phroneses.com/articles/leadership/notes/ai-and-brands-framework.html" rel="alternate"></link><published>2026-04-28T00:00:00+00:00</published><updated>2026-04-28T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-28:/articles/leadership/notes/ai-and-brands-framework.html</id><summary type="html">&lt;p&gt;AI strengthens brands when it improves precision, consistency, and control — and destroys them when it introduces noise.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="ai-and-brands-a-practical-framework-for-protecting-and-strengthening-brand-equity"&gt;AI and Brands: A Practical Framework for Protecting and Strengthening Brand Equity&lt;/h1&gt;
&lt;p&gt;Artificial intelligence is reshaping how organisations operate, communicate, and
compete. For brand‑led companies, the central question is not whether to adopt
AI, but how to do so without weakening the brand assets that drive long‑term
equity. Evidence from early adopters across consumer goods, luxury, retail,
financial services, and hospitality shows a consistent pattern: AI creates value
when it strengthens precision, consistency, and operational control. It destroys
value when it introduces noise, dilutes identity, or automates interactions that
depend on human judgement.&lt;/p&gt;
&lt;p&gt;This paper outlines a pragmatic framework for leaders who want to deploy AI
responsibly. It focuses on brand integrity, operational discipline, and
governance. The goal is to help organisations adopt AI in a way that protects
their distinctiveness and enhances long‑term brand value.&lt;/p&gt;
&lt;h1 id="1-protect-the-brands-voice"&gt;1. Protect the Brand's Voice&lt;/h1&gt;
&lt;p&gt;Brand equity is built on consistent language, narrative structure, and creative
identity. AI systems that generate content without guardrails often drift toward
generic phrasing and inconsistent tone. This risk increases when organisations
use public large language models trained on broad internet data.&lt;/p&gt;
&lt;p&gt;Leaders should ensure that AI reinforces the brand's established voice rather
than reinterpreting it. This requires controlled training data, clear tone
guidelines, and human review for all customer‑facing outputs.&lt;/p&gt;
&lt;h1 id="2-prioritise-precision-over-scale"&gt;2. Prioritise Precision Over Scale&lt;/h1&gt;
&lt;p&gt;Many AI deployments focus on volume: more content, more interactions, more
automation. Evidence from Harvard Business Review (2023) shows that this
approach often reduces quality and erodes brand trust. High‑performing
organisations use AI to improve accuracy, consistency, and operational
foresight, not to increase output indiscriminately.&lt;/p&gt;
&lt;p&gt;Precision‑oriented use cases include demand forecasting, inventory optimisation,
quality control, and internal decision support.&lt;/p&gt;
&lt;h1 id="3-keep-ai-invisible-to-the-customer"&gt;3. Keep AI Invisible to the Customer&lt;/h1&gt;
&lt;p&gt;Customer experience research as reported in Journal of Service Research (2022)
shows that trust, empathy, and discretion are strongest when interactions are
human‑led. AI should support frontline teams with insight and preparation, not
replace them. Automated customer communication often feels transactional and
reduces perceived brand value.&lt;/p&gt;
&lt;p&gt;AI is most effective when it enhances human performance without becoming visible
to the customer.&lt;/p&gt;
&lt;h1 id="4-avoid-generic-models-and-generic-content"&gt;4. Avoid Generic Models and Generic Content&lt;/h1&gt;
&lt;p&gt;Public models and automated content tools tend to produce language that is
interchangeable across brands. This undermines differentiation and introduces
tone drift. Organisations that rely on generic AI systems risk losing control of
their narrative and weakening their competitive position.&lt;/p&gt;
&lt;p&gt;Brand‑aligned AI requires private models, curated training data, and strict
governance.&lt;/p&gt;
&lt;h1 id="5-pilot-in-lowexposure-domains-first"&gt;5. Pilot in Low‑Exposure Domains First&lt;/h1&gt;
&lt;p&gt;The most successful AI programmes begin with internal, low‑risk domains where
accuracy and operational efficiency can be measured objectively. These include
forecasting, supply chain optimisation, service diagnostics, and workflow
scheduling.&lt;/p&gt;
&lt;p&gt;Early pilots should focus on measurable improvements and operational fit before
any customer‑facing deployment.&lt;/p&gt;
&lt;h1 id="6-build-private-controlled-models"&gt;6. Build Private, Controlled Models&lt;/h1&gt;
&lt;p&gt;Brand language, archives, and internal knowledge are strategic assets. They
should be treated as intellectual property and protected accordingly. Private
models trained on controlled datasets reduce the risk of data leakage, tone
drift, and unpredictable behaviour.&lt;/p&gt;
&lt;p&gt;A smaller, well‑governed model is often more effective than a large, public one.&lt;/p&gt;
&lt;h1 id="7-maintain-human-authority"&gt;7. Maintain Human Authority&lt;/h1&gt;
&lt;p&gt;AI can analyse patterns and surface insights, but final decisions should remain
human‑led. This is especially important in areas involving brand expression,
creative direction, and customer relationships.&lt;/p&gt;
&lt;p&gt;Human oversight ensures accountability, protects brand integrity, and prevents
over‑automation.&lt;/p&gt;
&lt;h1 id="8-govern-early-and-rigorously"&gt;8. Govern Early and Rigorously&lt;/h1&gt;
&lt;p&gt;Effective AI governance requires clear rules for data handling, model updates,
access control, and auditability. Organisations that establish governance early
experience fewer failures and lower reputational risk.&lt;/p&gt;
&lt;p&gt;Governance should include tone standards, review processes, and regular
evaluation of model behaviour.&lt;/p&gt;
&lt;h1 id="9-reject-ai-that-competes-with-brand-craft"&gt;9. Reject AI That Competes With Brand Craft&lt;/h1&gt;
&lt;p&gt;AI‑generated creative outputs, automated engagement systems, and public
authentication tools for goods (such as Entrupy) often conflict with the
brand's identity and expertise.  These systems can erode trust, reduce
perceived quality, and create a false sense of modernity.&lt;/p&gt;
&lt;p&gt;AI should never replace the craft, judgement, or creative leadership that define
the brand.&lt;/p&gt;
&lt;h1 id="10-use-ai-to-strengthen-what-makes-the-brand-distinctive"&gt;10. Use AI to Strengthen What Makes the Brand Distinctive&lt;/h1&gt;
&lt;p&gt;The purpose of AI is not to transform a brand into an "AI‑driven" organisation.
The purpose is to deepen the qualities that already differentiate the brand:
coherence, precision, reliability, and long‑term equity.&lt;/p&gt;
&lt;p&gt;AI should act as a precision instrument that enhances operational discipline and
brand consistency.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;AI can strengthen a brand when deployed with discipline, clarity, and strong
governance. It can weaken a brand when used without boundaries or when adopted
for speed rather than strategic fit. Industry leaders who treat AI as a tool for
precision, not automation, will protect their brand identity while gaining
measurable operational advantage.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="ai-luxury-watchmaking.html"&gt;Luxury maisons must adopt AI with restraint, using it as a precision instrument that protects craft, tone, and identity.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="tech-executives.html"&gt;Executives must treat LLMs as probabilistic systems requiring controls, governance, and new forms of oversight.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="transforming.html"&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#ai-and-brands-a-practical-framework-for-protecting-and-strengthening-brand-equity"&gt;AI and Brands: A Practical Framework for Protecting and Strengthening Brand Equity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#1-protect-the-brands-voice"&gt;1. Protect the Brand's Voice&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-prioritise-precision-over-scale"&gt;2. Prioritise Precision Over Scale&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-keep-ai-invisible-to-the-customer"&gt;3. Keep AI Invisible to the Customer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-avoid-generic-models-and-generic-content"&gt;4. Avoid Generic Models and Generic Content&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-pilot-in-lowexposure-domains-first"&gt;5. Pilot in Low‑Exposure Domains First&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-build-private-controlled-models"&gt;6. Build Private, Controlled Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-maintain-human-authority"&gt;7. Maintain Human Authority&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-govern-early-and-rigorously"&gt;8. Govern Early and Rigorously&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#9-reject-ai-that-competes-with-brand-craft"&gt;9. Reject AI That Competes With Brand Craft&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#10-use-ai-to-strengthen-what-makes-the-brand-distinctive"&gt;10. Use AI to Strengthen What Makes the Brand Distinctive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;McKinsey Global Institute, "The Economic Potential of Generative AI"&lt;/li&gt;
&lt;li&gt;Bain and Company, "How Leading Brands Use AI Without Losing Their Identity"&lt;/li&gt;
&lt;li&gt;Deloitte, "AI Governance: Balancing Innovation and Risk"&lt;/li&gt;
&lt;li&gt;Harvard Business Review, "When AI Enhances, Not Replaces, Human Judgment"&lt;/li&gt;
&lt;li&gt;MIT Sloan Management Review, "The Hidden Costs of AI‑Generated Content"&lt;/li&gt;
&lt;li&gt;Harvard Business Review (2023), "Consumers Prefer Human Creativity Over AI"&lt;/li&gt;
&lt;li&gt;Entrupy - https://www.entrupy.com/luxury-authentication/&lt;/li&gt;
&lt;/ul&gt;</content><category term="Leadership"></category></entry><entry><title>AI for Luxury Watchmaking: Discipline Over Display</title><link href="https://phroneses.com/articles/leadership/notes/ai-luxury-watchmaking.html" rel="alternate"></link><published>2026-04-28T00:00:00+00:00</published><updated>2026-04-28T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-28:/articles/leadership/notes/ai-luxury-watchmaking.html</id><summary type="html">&lt;p&gt;Luxury maisons must adopt AI with restraint, using it as a precision instrument that protects craft, tone, and identity.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Luxury watchmaking faces pressure to adopt AI at the pace of mass‑market
retail, yet most AI trends undermine the very qualities that define a maison:
scarcity, discretion, and narrative integrity. This piece argues for a
disciplined, tightly governed approach in which AI behaves like a precision
instrument — strengthening forecasting, consistency, atelier operations, and
clienteling — while avoiding automation that dilutes tone or erodes craft. The
maisons that lead will be those that adopt AI with restraint, clarity, and
long‑term intent, not speed.&lt;/p&gt;
&lt;h1 id="ai-for-luxury-watchmaking-precision-over-hype"&gt;AI for Luxury Watchmaking: Precision Over Hype&lt;/h1&gt;
&lt;p&gt;Luxury watchmaking has always balanced heritage and innovation. AI is now
unavoidable, and many maisons feel pressure to adopt it quickly. This piece
outlines where AI strengthens a watch manufacturer’s competitive position, and
where it introduces unnecessary risk.&lt;/p&gt;
&lt;h1 id="the-industry-tension-innovation-without-dilution"&gt;The Industry Tension: Innovation Without Dilution&lt;/h1&gt;
&lt;p&gt;Luxury watchmaking operates under a structural tension. A maison must preserve the integrity of its craft, its archives, and its creative identity, while the wider market moves at a pace set by digital platforms, globalised retail, and increasingly data‑driven competitors. The pressure to demonstrate technological progress is real, and the risk of adopting the wrong technology is equally real.&lt;/p&gt;
&lt;p&gt;AI is often presented as a universal solution, although most proposals are designed for mass‑market retail and not for a sector that trades on scarcity, discretion, and long‑term brand equity. Many AI deployments introduce operational noise, dilute the maison’s voice, or create a level of automation that conflicts with the expectations of collectors and high‑net‑worth clients. The industry has seen a wave of generic chatbots, automated outreach tools, and broad language models that promise efficiency and deliver inconsistency.&lt;/p&gt;
&lt;p&gt;The central question is not "Should we use AI" but "Where does AI reinforce what makes us rare". The answer lies in a disciplined approach that focuses on precision, control, and selective adoption. AI can support a maison when it strengthens the elements that define luxury watchmaking: exacting standards, consistent execution across global markets, and the ability to anticipate client needs without compromising the human relationship.&lt;/p&gt;
&lt;p&gt;The tension is therefore not between tradition and technology. The tension is between technology that respects the craft and technology that erodes it. AI can help a maison operate with greater foresight, greater consistency, and greater control over its identity. AI can also undermine the maison if it is deployed without clear boundaries. The opportunity lies in identifying the narrow set of use cases where AI behaves like a precision instrument rather than a mass‑market automation tool.&lt;/p&gt;
&lt;p&gt;A realistic approach recognises that AI is most valuable when it is invisible to the client, tightly governed, and aligned with the maison’s long‑term positioning. The maisons that succeed will be those that adopt AI with restraint, clarity, and a focus on reinforcing the qualities that already set them apart.&lt;/p&gt;
&lt;h1 id="where-ai-strengthens-a-watch-maison"&gt;Where AI Strengthens a Watch Maison&lt;/h1&gt;
&lt;h2 id="protecting-brand-voice-and-heritage"&gt;Protecting Brand Voice and Heritage&lt;/h2&gt;
&lt;p&gt;AI can act as a controlled reference system for maison language. It can
ensure that every market, boutique, and partner uses the same terms,
descriptions, and narrative structure that the atelier would use. This
reduces drift, removes local improvisation, and protects the tone that
collectors recognise.&lt;/p&gt;
&lt;p&gt;A fine‑tuned internal model can map archive material, historical
catalogues, and technical glossaries into a consistent linguistic
standard. This creates a single source of truth for product
descriptions, press notes, and after‑sales communication.&lt;/p&gt;
&lt;p&gt;Off‑the‑shelf chatbots introduce inconsistency and generic luxury
phrasing. They also risk accidental disclosure of internal language
patterns. A maison should avoid them entirely.&lt;/p&gt;
&lt;h2 id="precision-forecasting-for-limited-editions"&gt;Precision Forecasting for Limited Editions&lt;/h2&gt;
&lt;p&gt;AI can analyse historical demand, collector behaviour, macroeconomic
signals, and secondary‑market patterns to support decisions on
production volumes. This reduces the risk of over‑allocation and
under‑allocation, and it protects the reputation of the maison.&lt;/p&gt;
&lt;p&gt;A transparent model can show which variables drive demand. This allows
leadership to justify decisions with evidence rather than instinct
alone. It also supports more disciplined release planning.&lt;/p&gt;
&lt;p&gt;Opaque models that cannot explain their recommendations should be
avoided. A maison needs clarity, not guesswork wrapped in mathematics.&lt;/p&gt;
&lt;h2 id="strengthening-clienteling-without-massification"&gt;Strengthening Clienteling Without Massification&lt;/h2&gt;
&lt;p&gt;AI can support client advisors with discreet and context‑aware insights.
These insights can include purchase history, service intervals,
collector preferences, and upcoming milestones. The aim is to help the
advisor prepare, not to automate the interaction.&lt;/p&gt;
&lt;p&gt;AI can also identify subtle behavioural patterns, such as a client who
only responds to in‑person appointments or a collector who follows a
specific complication family. This allows advisors to act with greater
precision.&lt;/p&gt;
&lt;p&gt;Automated outreach that feels transactional undermines the human
relationship. A maison should avoid any system that sends messages
without human review.&lt;/p&gt;
&lt;h2 id="atelier-and-aftersales-efficiency"&gt;Atelier and After‑Sales Efficiency&lt;/h2&gt;
&lt;p&gt;AI can support predictive maintenance for complications and movements.
It can identify early signs of wear from service records, images, and
bench data. This allows the atelier to plan work more effectively.&lt;/p&gt;
&lt;p&gt;AI can optimise scheduling for watchmakers by matching complexity,
parts availability, and historical repair times. This reduces idle time
and improves throughput without compromising craftsmanship.&lt;/p&gt;
&lt;p&gt;AI‑assisted diagnostics can shorten the time between intake and
assessment. The watchmaker still makes the final decision. Human
judgement remains essential for quality control.&lt;/p&gt;
&lt;h2 id="provenance-traceability-and-anticounterfeit-measures"&gt;Provenance, Traceability, and Anti‑Counterfeit Measures&lt;/h2&gt;
&lt;p&gt;AI‑enhanced image recognition can authenticate watches from micro‑
details that are invisible to the naked eye. This strengthens
provenance checks and reduces reliance on manual inspection alone.&lt;/p&gt;
&lt;p&gt;Provenance systems can combine blockchain records and AI anomaly
detection to flag suspicious transfers or listings. This protects both
the maison and the collector.&lt;/p&gt;
&lt;p&gt;Public‑facing "AI authentication apps" undermine exclusivity and create
false confidence. A maison should avoid them. Authentication should
remain controlled, discreet, and expert‑led.&lt;/p&gt;
&lt;h1 id="what-luxury-watch-brands-should-ignore-for-now"&gt;What Luxury Watch Brands Should Ignore For Now&lt;/h1&gt;
&lt;p&gt;Luxury watchmaking gains nothing from technology that creates noise,
dilutes identity, or introduces operational risk. Several AI trends are
highly visible and highly unsuitable for a maison that trades on
precision, scarcity, and long‑term equity.&lt;/p&gt;
&lt;p&gt;One trend is the push toward generic generative‑AI content. This
includes automated product descriptions, automated social posts, and
automated campaign copy. These systems produce language that feels
interchangeable across brands. They flatten tone, remove nuance, and
replace the maison’s voice with a synthetic approximation. For a sector
that relies on narrative integrity, this is a direct threat.&lt;/p&gt;
&lt;p&gt;Or consider the rise of fully automated customer service. Many
vendors promote AI as a replacement for human interaction. This may work
in mass‑market retail, although it is unsuitable for luxury. Automated
systems struggle with discretion, context, and emotional intelligence.
They also create a visible gap between the client and the maison at the
exact moment when trust matters most.&lt;/p&gt;
&lt;p&gt;Lastly, the deployment of broad, ungoverned language models is proving more
popular.  These models are often trained on public data and they behave in ways
t#hat are difficult to predict. They can leak internal phrasing, drift in tone,
and generate outputs that conflict with brand standards. They also introduce
data‑handling risks that are incompatible with the privacy expectations of
high‑net‑worth clients.&lt;/p&gt;
&lt;p&gt;A maison that values long‑term equity should treat these trends with
caution. They offer speed, although they do not offer precision. They
signal modernity, although they do not strengthen the qualities that
make a luxury watchmaker distinctive. The disciplined path is to ignore
these trends and focus on AI that enhances control, consistency, and
craft.&lt;/p&gt;
&lt;p&gt;Generic generative‑AI marketing content should be avoided. It produces
language that feels interchangeable with mass‑market retail and it
erodes the distinct tone that collectors expect. It also creates a false
sense of digital progress without improving any core capability.&lt;/p&gt;
&lt;p&gt;AI‑designed watches should be avoided. They conflict with the creative
identity of the maison and they reduce design to pattern matching. A
watch is an expression of craft, not an output of algorithmic
experimentation.&lt;/p&gt;
&lt;p&gt;Broad and ungoverned LLM deployments should be avoided. They risk data
leakage, tone drift, and inconsistent behaviour across markets. They
also create dependencies that are difficult to unwind.&lt;/p&gt;
&lt;p&gt;A disciplined maison ignores these trends and focuses on AI that
strengthens precision, consistency, and long‑term brand integrity.&lt;/p&gt;
&lt;h1 id="a-practical-lowrisk-ai-roadmap-for-a-watch-maison"&gt;A Practical, Low‑Risk AI Roadmap for a Watch Maison&lt;/h1&gt;
&lt;h2 id="establish-a-brandaligned-ai-charter"&gt;Establish a Brand‑Aligned AI Charter&lt;/h2&gt;
&lt;p&gt;A maison needs a clear charter before it adopts any AI system. The
charter defines what AI must never do, such as dilute tone, automate
client relationships, or expose internal language patterns. It also
defines what AI should do, such as improve forecasting, strengthen
consistency, and support atelier operations. Every decision should be
anchored in heritage, precision, and discretion. This prevents drift and
keeps the programme focused on long‑term equity rather than short‑term
experiments.&lt;/p&gt;
&lt;h2 id="build-a-controlled-and-private-model"&gt;Build a Controlled and Private Model&lt;/h2&gt;
&lt;p&gt;A maison should build a controlled model that is trained on its own
archives, glossaries, and tone guidelines. This creates a private
linguistic and operational asset that reflects the identity of the
brand. The model should remain behind the firewall and should be treated
as intellectual property. A small and well‑governed model is easier to
audit, easier to update, and less likely to behave unpredictably. This
approach avoids the risks associated with broad public models.&lt;/p&gt;
&lt;h2 id="pilot-in-noncustomerfacing-domains"&gt;Pilot in Non‑Customer‑Facing Domains&lt;/h2&gt;
&lt;p&gt;The safest starting point is to pilot AI in areas that do not touch the
client. Forecasting, atelier scheduling, and after‑sales diagnostics are
ideal candidates. These domains benefit from pattern recognition and
data analysis, and they allow the maison to test accuracy, governance,
and operational fit without reputational exposure. Early pilots should
focus on measurable improvements, such as reduced turnaround time or
more accurate allocation planning. This builds internal confidence
before any client‑facing deployment.&lt;/p&gt;
&lt;h2 id="introduce-ai-to-clienteling-as-a-silent-partner"&gt;Introduce AI to Clienteling as a Silent Partner&lt;/h2&gt;
&lt;p&gt;When the maison is ready to extend AI to the client experience, it
should do so with restraint. AI should act as a silent partner that
supports the advisor with insights, not scripts. It can highlight
service intervals, collector preferences, and relevant milestones. It
should never generate messages on its own. The advisor remains the
author of every interaction. This preserves the human relationship and
ensures that the maison’s tone remains intact.&lt;/p&gt;
&lt;h2 id="establish-governance-early"&gt;Establish Governance Early&lt;/h2&gt;
&lt;p&gt;Governance is essential from the outset. Every client‑facing output
should receive human review. Every model decision should have an audit
trail. Tone and accuracy checks should be conducted regularly. The
maison should also define clear rules for data handling, model updates,
and access control. Strong governance prevents drift, protects client
privacy, and ensures that AI remains aligned with the values of the
brand.&lt;/p&gt;
&lt;p&gt;A disciplined roadmap allows a maison to adopt AI without compromising
craft, identity, or exclusivity. The goal is not to automate luxury. The
goal is to use AI to strengthen the qualities that already make the
maison distinctive.&lt;/p&gt;
&lt;h1 id="the-competitive-advantage-ai-as-a-precision-instrument"&gt;The Competitive Advantage: AI as a Precision Instrument&lt;/h1&gt;
&lt;p&gt;The maisons that will lead are not the maisons that adopt AI at speed.
They are the maisons that adopt AI with discipline, clear boundaries,
and a focus on long‑term equity. Speed creates noise. Discipline creates
advantage.&lt;/p&gt;
&lt;p&gt;AI should behave like a fine tool on a watchmaker’s bench. It should be
precise, reliable, and invisible to the client. The value comes from
quiet improvements in forecasting, consistency, and operational control,
not from visible automation or digital theatrics.&lt;/p&gt;
&lt;p&gt;A disciplined maison uses AI to strengthen the elements that already
define its position: exacting standards, coherent global execution, and
a client experience built on trust. AI can support these strengths by
reducing variance, improving anticipation, and protecting the maison’s
voice across markets.&lt;/p&gt;
&lt;p&gt;The goal is not to become an "AI‑driven brand". The goal is to use AI to
deepen what already makes the maison exceptional. When AI is treated as
a precision instrument, it enhances craft rather than competes with it.&lt;/p&gt;
&lt;h1 id="closing-thought"&gt;Closing Thought&lt;/h1&gt;
&lt;p&gt;Luxury watchmaking has survived every major technological shift through
careful selection and disciplined restraint. AI is no different. The
value lies in choosing the narrow set of applications that strengthen
craft, consistency, and control, and ignoring the noise that surrounds
the wider market.&lt;/p&gt;
&lt;p&gt;When applied with purpose and respect for the métier, AI becomes an
instrument of precision. It sharpens forecasting, protects identity, and
supports the atelier without altering the essence of the work. It
remains silent, reliable, and firmly under human direction.&lt;/p&gt;
&lt;p&gt;A maison that treats AI in this way preserves heritage while gaining a
measurable operational advantage. The craft stays intact. The identity
remains coherent. The technology serves the brand, not the other way
round.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="ai-and-brands-framework.html"&gt;AI strengthens brands when it improves precision, consistency, and control — and destroys them when it introduces noise.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="tech-executives.html"&gt;Executives must treat LLMs as probabilistic systems requiring controls, governance, and new forms of oversight.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="transforming.html"&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#ai-for-luxury-watchmaking-precision-over-hype"&gt;AI for Luxury Watchmaking: Precision Over Hype&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-industry-tension-innovation-without-dilution"&gt;The Industry Tension: Innovation Without Dilution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#where-ai-strengthens-a-watch-maison"&gt;Where AI Strengthens a Watch Maison&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#protecting-brand-voice-and-heritage"&gt;Protecting Brand Voice and Heritage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#precision-forecasting-for-limited-editions"&gt;Precision Forecasting for Limited Editions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#strengthening-clienteling-without-massification"&gt;Strengthening Clienteling Without Massification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#atelier-and-aftersales-efficiency"&gt;Atelier and After‑Sales Efficiency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#provenance-traceability-and-anticounterfeit-measures"&gt;Provenance, Traceability, and Anti‑Counterfeit Measures&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-luxury-watch-brands-should-ignore-for-now"&gt;What Luxury Watch Brands Should Ignore For Now&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-lowrisk-ai-roadmap-for-a-watch-maison"&gt;A Practical, Low‑Risk AI Roadmap for a Watch Maison&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#establish-a-brandaligned-ai-charter"&gt;Establish a Brand‑Aligned AI Charter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#build-a-controlled-and-private-model"&gt;Build a Controlled and Private Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pilot-in-noncustomerfacing-domains"&gt;Pilot in Non‑Customer‑Facing Domains&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#introduce-ai-to-clienteling-as-a-silent-partner"&gt;Introduce AI to Clienteling as a Silent Partner&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#establish-governance-early"&gt;Establish Governance Early&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-competitive-advantage-ai-as-a-precision-instrument"&gt;The Competitive Advantage: AI as a Precision Instrument&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#closing-thought"&gt;Closing Thought&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Leadership"></category></entry><entry><title>10 Everyday AI Workflows That Save Hours</title><link href="https://phroneses.com/articles/foundations/notes/10-things.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/foundations/notes/10-things.html</id><summary type="html">&lt;p&gt;Ten simple AI workflows that save minutes each day and compound into hours each week, helping people work more efficiently.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Artificial intelligence is a practical tool that speeds up routine thinking
tasks. These ten workflows show how everyone can use it to save minutes every
day. Those minutes add up into hours each week. And practise will make you
prompt perfect.&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="1-turn-messy-notes-into-clean-summaries"&gt;1. Turn messy notes into clean summaries&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You paste a rambling 500‑word meeting transcript. The system produces a clear summary with action points.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Here are my messy meeting notes. Please summarise the key decisions and list the action items clearly."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="2-draft-emails-from-bullet-points"&gt;2. Draft emails from bullet points&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You write a few rough points. The system turns them into a polished email.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Turn these bullet points into a polite, professional email: apologise for delay and ask for feedback by this Friday."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="3-explain-complex-topics-in-plain-english"&gt;3. Explain complex topics in plain English&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You paste a confusing medical letter. The system rewrites it in simple, accurate language.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Rewrite this in plain English for a non‑expert reader. Keep it accurate but simple. Do not add anything to the content."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="4-create-quick-plans-for-travel-meals-or-events"&gt;4. Create quick plans for travel, meals, or events&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You request a two‑day trip plan. The system provides a structured itinerary with alternatives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Plan a two‑day trip to Edinburgh with indoor options if it rains. Include timings."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="5-turn-long-articles-into-short-takeaways"&gt;5. Turn long articles into short takeaways&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You paste a long news article. The system produces a five‑point summary.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Summarise this article into five key points and give me a one‑sentence takeaway."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="6-brainstorm-ideas-when-you-feel-stuck"&gt;6. Brainstorm ideas when you feel stuck&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You need a name for a community newsletter. The system generates several options.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Give me ten name ideas for a friendly community newsletter about local events."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="7-rewrite-text-in-different-tones"&gt;7. Rewrite text in different tones&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You paste a blunt message. The system rewrites it in a more diplomatic tone.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Rewrite this message to be polite and constructive while keeping the meaning."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="8-extract-key-information-from-documents"&gt;8. Extract key information from documents&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You upload a contract. The system identifies renewal dates, obligations, and risks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Extract the key dates, obligations, and cancellation terms from this contract. Do not invent anything. Only use the data I have provided to you."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="9-create-checklists-from-goals"&gt;9. Create checklists from goals&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You want to declutter your house. The system turns this into a room‑by‑room checklist.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Turn this goal into a step‑by‑step checklist: declutter my entire house this month."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="10-turn-data-into-quick-insights"&gt;10. Turn data into quick insights&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;br/&gt;
You paste a small spreadsheet of expenses. The system highlights trends and suggests improvements.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example prompt&lt;/strong&gt;&lt;br/&gt;
"Here is my monthly spending data. Identify trends and suggest three ways to reduce costs. Use only the data I have provided to you."&lt;/p&gt;
&lt;hr/&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Begin with one or two workflows and expand from there. Small time savings
accumulate quickly, and these tools can help you stay organised, informed, and
in control.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="ai-chatbot-prompting.html"&gt;Ten simple AI workflows that save minutes each day and compound into hours each week, helping people work more efficiently.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="designing-ai-prompts.html"&gt;Modern AI systems require structured, multi‑step prompts that guide planning, critique, and long‑context reasoning.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="how-to-use.html"&gt;Guidance on using AI safely and effectively, grounded in recent examples of misuse and emerging best practices.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#1-turn-messy-notes-into-clean-summaries"&gt;1. Turn messy notes into clean summaries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-draft-emails-from-bullet-points"&gt;2. Draft emails from bullet points&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-explain-complex-topics-in-plain-english"&gt;3. Explain complex topics in plain English&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-create-quick-plans-for-travel-meals-or-events"&gt;4. Create quick plans for travel, meals, or events&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-turn-long-articles-into-short-takeaways"&gt;5. Turn long articles into short takeaways&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-brainstorm-ideas-when-you-feel-stuck"&gt;6. Brainstorm ideas when you feel stuck&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-rewrite-text-in-different-tones"&gt;7. Rewrite text in different tones&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-extract-key-information-from-documents"&gt;8. Extract key information from documents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#9-create-checklists-from-goals"&gt;9. Create checklists from goals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#10-turn-data-into-quick-insights"&gt;10. Turn data into quick insights&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Foundations"></category></entry><entry><title>Building Safe, Compliant and Sustainable LLM Systems</title><link href="https://phroneses.com/articles/leadership/notes/building-safe-llm-systems.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/leadership/notes/building-safe-llm-systems.html</id><summary type="html">&lt;p&gt;LLM systems behave differently from traditional software and require layered safety, strong governance, observability, and architectural discipline to operate reliably and sustainably.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="building-safe-compliant-and-sustainable-llm-systems"&gt;Building Safe, Compliant, and Sustainable LLM Systems&lt;/h1&gt;
&lt;p&gt;Large language models have introduced a profound shift in how software systems
are conceived, built, and governed.&lt;/p&gt;
&lt;p&gt;LLMs behave differently from traditional software, they introduce new
categories of operational and regulatory risk, and they demand a level of
architectural discipline that many organisations have not yet developed. Senior
engineering leaders must therefore approach LLM adoption not as a technical
experiment, but as a strategic transformation that affects safety, compliance,
cost control, and organisational design.&lt;/p&gt;
&lt;p&gt;This article sets out the principles, mandates, measurements, processes, and
governance structures required to build reliable, auditable, and economically
sustainable LLM systems. It is written for leaders who must ensure that their
organisations deploy these technologies with clarity, discipline, and long‑term
resilience.&lt;/p&gt;
&lt;h2 id="why-llm-systems-behave-differently-from-traditional-software"&gt;Why LLM Systems Behave Differently from Traditional Software&lt;/h2&gt;
&lt;p&gt;Traditional software is deterministic. Given the same inputs, it produces the
same outputs. Its behaviour is governed by explicit logic, and its failure modes
are generally predictable. LLM systems are different. They are probabilistic,
context‑sensitive, and heavily influenced by the data and instructions that
surround them. Their behaviour can drift over time as models are updated,
retrieval indexes age, and prompts evolve.&lt;/p&gt;
&lt;p&gt;This difference has significant implications. An LLM system is not a single
component but a pipeline of retrieval, orchestration, context assembly, and
model inference. Most of the risk lies not in the model itself, but in the
machinery wrapped around it. The system behaves more like a distributed
workflow, where each step introduces latency, ambiguity, and potential failure.
This is why LLM systems require a different form of engineering discipline and a
different form of leadership oversight.&lt;/p&gt;
&lt;h2 id="what-this-means-for-safety-compliance-and-cost"&gt;What This Means for Safety, Compliance, and Cost&lt;/h2&gt;
&lt;p&gt;Because LLM systems are probabilistic and context‑dependent, they introduce
safety risks that cannot be addressed by persuasion or by relying on the model
to behave. Safety requires layered controls, deterministic boundaries, and
independent checks. Compliance requires observability across the entire
pipeline, not just the final output. Cost control requires architectural
discipline, because most expenditure arises from retrieval hops, long prompts,
and orchestration overhead rather than from the model itself.&lt;/p&gt;
&lt;p&gt;The business consequences are clear. Without strong governance, an LLM system
can drift into non‑compliant behaviour, generate outputs that cannot be audited,
or accumulate cloud costs that grow faster than the user base. Leaders must
therefore treat LLM systems as operational assets that require continuous
monitoring, disciplined design, and explicit accountability.&lt;/p&gt;
&lt;h2 id="what-leaders-must-mandate"&gt;What Leaders Must Mandate&lt;/h2&gt;
&lt;p&gt;Senior leaders must set the tone and direction. The following mandates are
essential:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The organisation must treat LLM systems as engineered pipelines, not magical
components.&lt;/li&gt;
&lt;li&gt;Safety must be enforced through layered controls outside the model.&lt;/li&gt;
&lt;li&gt;Retrieval must be disciplined, localised, and monitored for freshness.&lt;/li&gt;
&lt;li&gt;Prompts must be treated as executable logic, not prose.&lt;/li&gt;
&lt;li&gt;Observability must capture every transformation, including retrieval sets,
template expansions, and decoding parameters.&lt;/li&gt;
&lt;li&gt;Latency and cost must be managed through architectural simplification, not
through attempts to accelerate the model.&lt;/li&gt;
&lt;li&gt;Continuous evaluation must be mandatory, because behaviour drifts over time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These mandates establish the foundation for predictable, compliant, and
economically sustainable systems.&lt;/p&gt;
&lt;h2 id="what-teams-must-measure"&gt;What Teams Must Measure&lt;/h2&gt;
&lt;p&gt;Measurement is essential for control. Teams must track:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Retrieval quality and freshness, because stale or irrelevant context is a
major source of error.&lt;/li&gt;
&lt;li&gt;Latency across the entire pipeline, not just the model call.&lt;/li&gt;
&lt;li&gt;Prompt length and token usage, because long prompts silently inflate cost and
delay.&lt;/li&gt;
&lt;li&gt;Orchestration overhead, including serial tool calls and unnecessary network
hops.&lt;/li&gt;
&lt;li&gt;Behavioural drift, measured through continuous evaluation against real
traffic.&lt;/li&gt;
&lt;li&gt;Safety violations caught by guardrails, and those that slipped through.&lt;/li&gt;
&lt;li&gt;Cloud expenditure broken down by retrieval, orchestration, and inference.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These measurements allow leaders to understand where risk accumulates and where
costs originate.&lt;/p&gt;
&lt;h2 id="what-processes-must-change"&gt;What Processes Must Change&lt;/h2&gt;
&lt;p&gt;LLM systems require new processes that reflect their probabilistic nature and
their architectural complexity. Traditional software processes are insufficient.
Organisations must introduce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Continuous evaluation pipelines that run against real user traffic patterns.&lt;/li&gt;
&lt;li&gt;Retrieval monitoring processes that detect index drift and data staleness.&lt;/li&gt;
&lt;li&gt;Prompt review processes that treat prompts as code and enforce structure.&lt;/li&gt;
&lt;li&gt;Safety review processes that test layered guardrails under varied phrasing.&lt;/li&gt;
&lt;li&gt;Cost review processes that examine token usage, retrieval hops, and
orchestration patterns.&lt;/li&gt;
&lt;li&gt;Incident response processes that include retrieval logs, template expansions,
and decoding parameters.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These processes ensure that the system remains stable, compliant, and
economically viable over time.&lt;/p&gt;
&lt;h2 id="what-architectural-principles-must-be-enforced"&gt;What Architectural Principles Must Be Enforced&lt;/h2&gt;
&lt;p&gt;Architectural discipline is the strongest determinant of safety, reliability,
and cost. Leaders must enforce the following principles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Latency is architectural. Most delay comes from retrieval hops, network
boundaries, and orchestration overhead.&lt;/li&gt;
&lt;li&gt;Retrieval must be minimal, local, and purposeful. Excessive retrieval behaves
like an over‑eager microservice mesh.&lt;/li&gt;
&lt;li&gt;Prompts must be short, structured, and treated as logic.&lt;/li&gt;
&lt;li&gt;Context windows are scratchpads, not memory. Only relevant information should
enter them.&lt;/li&gt;
&lt;li&gt;Safety must be enforced through deterministic layers, not through persuasive
instructions.&lt;/li&gt;
&lt;li&gt;Pipelines must avoid serial tool chains that behave like queues.&lt;/li&gt;
&lt;li&gt;Orchestration must be simplified wherever possible, because overhead
accumulates across every request.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These principles reduce risk, improve predictability, and control cost.&lt;/p&gt;
&lt;h2 id="what-governance-structures-must-be-introduced"&gt;What Governance Structures Must Be Introduced&lt;/h2&gt;
&lt;p&gt;Governance is essential for organisations that wish to deploy LLM systems at
scale. Leaders must introduce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A cross‑functional LLM governance board that oversees safety, compliance, and
cost.&lt;/li&gt;
&lt;li&gt;A prompt governance process that ensures consistency, clarity, and auditability.&lt;/li&gt;
&lt;li&gt;A retrieval governance process that monitors data freshness, index quality,
and access control.&lt;/li&gt;
&lt;li&gt;A safety governance framework that defines layered guardrails and tests them
regularly.&lt;/li&gt;
&lt;li&gt;A cost governance framework that tracks expenditure and enforces architectural
discipline.&lt;/li&gt;
&lt;li&gt;A model update governance process that evaluates behavioural drift before
deployment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These structures ensure that the organisation maintains control over systems
that are inherently probabilistic and prone to drift.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;LLM systems offer extraordinary potential, but they demand a level of
discipline, governance, and architectural clarity that many organisations have
not yet developed. They behave differently from traditional software, and they
introduce new categories of risk that cannot be managed through persuasion or
intuition. Senior leaders must therefore mandate strong architectural
principles, enforce rigorous measurement, introduce new processes, and build
governance structures that ensure safety, compliance, and cost control.&lt;/p&gt;
&lt;p&gt;The organisations that succeed will be those that treat LLM systems as
engineered pipelines, that design for predictability and auditability, and that
recognise that the true challenges lie not in the model, but in the machinery
that surrounds it.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="tech-executives.html"&gt;Executives must treat LLMs as probabilistic systems requiring controls, governance, and new forms of oversight.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai.html"&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="surface-area.html"&gt;AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#building-safe-compliant-and-sustainable-llm-systems"&gt;Building Safe, Compliant, and Sustainable LLM Systems&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#why-llm-systems-behave-differently-from-traditional-software"&gt;Why LLM Systems Behave Differently from Traditional Software&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-this-means-for-safety-compliance-and-cost"&gt;What This Means for Safety, Compliance, and Cost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-leaders-must-mandate"&gt;What Leaders Must Mandate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-teams-must-measure"&gt;What Teams Must Measure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-processes-must-change"&gt;What Processes Must Change&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-architectural-principles-must-be-enforced"&gt;What Architectural Principles Must Be Enforced&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-governance-structures-must-be-introduced"&gt;What Governance Structures Must Be Introduced&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Leadership"></category></entry><entry><title>Evaluating AI Systems: Metrics that Matter</title><link href="https://phroneses.com/articles/build/notes/evaluate-ai.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/build/notes/evaluate-ai.html</id><summary type="html">&lt;p&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This article presents metrics that matter to help you evaluate an LLM
for programmatic use.&lt;/p&gt;
&lt;h1 id="metrics-to-evaluate-ai-systems"&gt;Metrics to Evaluate AI Systems&lt;/h1&gt;
&lt;h2 id="1-evaluation-as-an-engineering-discipline"&gt;1. Evaluation as an Engineering Discipline&lt;/h2&gt;
&lt;p&gt;Evaluating an AI system differs from evaluating deterministic software.
LLMs generate tokens based on probability, so behaviour varies across
runs and model updates. Effective evaluation focuses on observable
behaviour, failure modes, and interface stability. The aim is to measure
real system behaviour, not synthetic benchmarks.&lt;/p&gt;
&lt;h2 id="2-the-evaluation-surface-area-an-ai-system-exposes-a-wide-surface-area"&gt;2. The Evaluation Surface Area An AI system exposes a wide surface area.&lt;/h2&gt;
&lt;p&gt;Some parts are controlled by the model, such as token prediction, internal
weights, and sampling.  Other parts are controlled by you, including prompt
structure, constraints, retrieval inputs, output formats, and integration.
Good evaluation measures the combined behaviour of both sides.&lt;/p&gt;
&lt;h2 id="3-core-metrics-for-programmatic-use"&gt;3. Core Metrics for Programmatic Use&lt;/h2&gt;
&lt;p&gt;Systems that call an LLM as a component must measure schema reliability,
instruction adherence, deterministic stability, and latency. Schema
reliability covers valid JSON, field completeness, and type correctness.
Instruction adherence measures how well the model follows constraints.
Deterministic stability checks variance under fixed sampling. Latency
covers time to first token, total response time, and variability.&lt;/p&gt;
&lt;h2 id="4-metrics-for-rag-systems"&gt;4. Metrics for RAG Systems&lt;/h2&gt;
&lt;p&gt;RAG adds new evaluation needs. Grounding fidelity measures alignment between
claims and retrieved documents. Fidelity is about how faithfully the model
sticks to the source material.  Citation accuracy checks that references are
correct and not invented. Retrieval quality evaluates recall, precision, and
chunking impact. These metrics show whether the system uses retrieval
effectively.&lt;/p&gt;
&lt;h2 id="5-metrics-for-publicfacing-systems"&gt;5. Metrics for Public‑Facing Systems&lt;/h2&gt;
&lt;p&gt;Public‑facing systems require safety and behavioural stability. Safety
metrics measure disallowed or high‑risk content and consistency across
paraphrased prompts. Behavioural stability measures tone consistency,
avoidance of persona drift, and predictability across varied inputs.&lt;/p&gt;
&lt;h2 id="6-metrics-for-reasoning-systems"&gt;6. Metrics for Reasoning Systems&lt;/h2&gt;
&lt;p&gt;Reasoning systems must evaluate logical consistency, task breakdown, and
error sensitivity. Logical consistency checks for contradictions.
Task breakdown measures whether sub‑tasks are identified and ordered
correctly. Error sensitivity evaluates behaviour under incomplete or
conflicting information.&lt;/p&gt;
&lt;h2 id="7-failure-mode-analysis"&gt;7. Failure Mode Analysis&lt;/h2&gt;
&lt;p&gt;Evaluation must include attempts to trigger failure modes. Boundary
tests check for fabricated tools or capabilities. Hallucination tests
examine behaviour under missing, conflicting, or overloaded context.
Prompt dilution tests measure behaviour when constraints overlap or when
the system prompt becomes long.&lt;/p&gt;
&lt;h2 id="8-longitudinal-metrics"&gt;8. Longitudinal Metrics&lt;/h2&gt;
&lt;p&gt;AI systems change over time, so evaluation must track drift. Model
update drift measures behavioural changes after updates and detects
regressions. Prompt stability metrics measure sensitivity to small edits
or ordering changes. Longitudinal evaluation ensures stability as the
model evolves.&lt;/p&gt;
&lt;h2 id="9-practical-evaluation-framework"&gt;9. Practical Evaluation Framework&lt;/h2&gt;
&lt;p&gt;A practical framework includes unit tests for prompt layers, integration
tests for retrieval, and end‑to‑end tests for workflows. Golden sets
provide curated inputs with expected outputs for regression detection.
Failure logging categorises schema errors, grounding failures, reasoning
failures, and safety violations.&lt;/p&gt;
&lt;h2 id="10-evaluation-as-ongoing-engineering-work"&gt;10. Evaluation as Ongoing Engineering Work&lt;/h2&gt;
&lt;p&gt;Evaluation is continuous. AI systems require ongoing measurement because
their behaviour is probabilistic and subject to change. Metrics must
reflect real failure modes and integration points.&lt;/p&gt;
&lt;p&gt;A structured evaluation framework produces systems that behave predictably,
integrate cleanly, and remain stable over time.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Evaluating AI systems is not a narrow task.&lt;/p&gt;
&lt;p&gt;It spans deterministic correctness, probabilistic behaviour, grounding,
safety, reasoning, retrieval, latency, and long‑term drift.&lt;/p&gt;
&lt;p&gt;The surface area is far larger than that of conventional software components,
because an AI system is not only the model but also the constraints, prompts,
retrieval pipeline, and integration code wrapped around it.&lt;/p&gt;
&lt;p&gt;A structured evaluation framework is therefore essential.&lt;/p&gt;
&lt;p&gt;Programmatic use requires metrics for schema reliability, instruction
adherence, deterministic stability, and latency.&lt;/p&gt;
&lt;p&gt;RAG systems add grounding fidelity, citation accuracy, and retrieval quality.&lt;/p&gt;
&lt;p&gt;Public‑facing systems require safety and behavioural stability.&lt;/p&gt;
&lt;p&gt;Reasoning systems require checks for logical consistency, task decomposition,
and error sensitivity.&lt;/p&gt;
&lt;p&gt;Failure mode analysis must deliberately probe boundary violations,
hallucination conditions, and prompt dilution.&lt;/p&gt;
&lt;p&gt;Longitudinal metrics must track drift across model updates and prompt changes.&lt;/p&gt;
&lt;p&gt;A practical framework must combine unit tests for prompt layers, integration
tests for retrieval, end‑to‑end workflow tests, golden sets, and structured
failure logging.&lt;/p&gt;
&lt;p&gt;The conclusion is unavoidable: this is not work that can be handled as a
side‑task by feature developers. The evaluation load is continuous,
specialised, and multi‑disciplinary. It requires expertise in retrieval,
safety, reasoning, software correctness, and long‑term system behaviour.
It requires adversarial testing, regression detection, and maintenance of
a living evaluation suite. The cost of inadequate evaluation is high:
schema failures, grounding errors, safety issues, reasoning faults, and
silent regressions, any one of which may lead to a lack of compliance and
statutory exposure.&lt;/p&gt;
&lt;p&gt;AI evaluation is its own engineering discipline. It requires a dedicated
team with clear ownership, specialised tooling, and ongoing responsibility
for ensuring that AI systems behave predictably, integrate cleanly, and
remain stable over time.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="engineers-need-to-know.html"&gt;Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="latency-is-architecural.html"&gt;Most latency comes from retrieval hops and orchestration, not the model; RAG pipelines often recreate microservice-style chatter that slows systems down.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="surface-area.html"&gt;AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#metrics-to-evaluate-ai-systems"&gt;Metrics to Evaluate AI Systems&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#1-evaluation-as-an-engineering-discipline"&gt;1. Evaluation as an Engineering Discipline&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-the-evaluation-surface-area-an-ai-system-exposes-a-wide-surface-area"&gt;2. The Evaluation Surface Area An AI system exposes a wide surface area.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-core-metrics-for-programmatic-use"&gt;3. Core Metrics for Programmatic Use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-metrics-for-rag-systems"&gt;4. Metrics for RAG Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-metrics-for-publicfacing-systems"&gt;5. Metrics for Public‑Facing Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-metrics-for-reasoning-systems"&gt;6. Metrics for Reasoning Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-failure-mode-analysis"&gt;7. Failure Mode Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-longitudinal-metrics"&gt;8. Longitudinal Metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#9-practical-evaluation-framework"&gt;9. Practical Evaluation Framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#10-evaluation-as-ongoing-engineering-work"&gt;10. Evaluation as Ongoing Engineering Work&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Build"></category></entry><entry><title>How to Evaluate the Output of an AI Chat Session</title><link href="https://phroneses.com/articles/foundations/notes/evaluate-ai-chatbot.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/foundations/notes/evaluate-ai-chatbot.html</id><summary type="html">&lt;p&gt;A practical guide to assessing the quality, reliability, and safety of AI chat session outputs.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="how-to-evaluate-the-output-of-an-ai-chat-session"&gt;How to Evaluate the Output of an AI Chat Session&lt;/h1&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Many people now use chat systems powered by artificial intelligence for writing,
research, planning, or quick explanations. These systems can be helpful, but
their output varies in quality. Some responses are clear and accurate, while
others may be incomplete, misleading, or overly confident. Understanding how to
evaluate what you receive makes the experience more efficient and safer.&lt;/p&gt;
&lt;p&gt;A simple example shows why this matters. Someone might ask a chat system for a
summary of a historical event and receive a clear explanation. The same person
might then ask for a legal interpretation and receive an answer that sounds
confident but is not reliable. The difference is not always obvious from the
tone of the response.&lt;/p&gt;
&lt;h2 id="start-with-the-purpose-of-the-conversation"&gt;Start With the Purpose of the Conversation&lt;/h2&gt;
&lt;p&gt;It helps to keep in mind what you are trying to achieve. A chat system can
produce ideas, drafts, explanations, or examples very quickly. It is less
reliable when the task requires specialist judgement, up‑to‑date facts, or
precise interpretation.&lt;/p&gt;
&lt;p&gt;For instance, asking for help brainstorming a travel itinerary is usually safe.
Asking for a diagnosis based on symptoms is not. The system may sound equally
confident in both cases, so the purpose of the conversation matters.&lt;/p&gt;
&lt;h2 id="check-whether-the-output-matches-the-question"&gt;Check Whether the Output Matches the Question&lt;/h2&gt;
&lt;p&gt;Sometimes a chat system answers a slightly different question from the one you
asked. This can happen when the prompt is broad or when the system tries to
guess your intent.&lt;/p&gt;
&lt;p&gt;A simple way to check is to read the answer and ask whether it addresses the
specific point you raised. If you ask for "three reasons why a bridge design
failed" and receive a general explanation of bridge engineering, the output is
not wrong, but it is not what you asked for.&lt;/p&gt;
&lt;h2 id="look-for-verifiable-details"&gt;Look for Verifiable Details&lt;/h2&gt;
&lt;p&gt;Useful responses often contain information that can be checked. This might be a
definition, a date, a description of a process, or a reference to a known
concept. When a response includes details that can be confirmed, it becomes
easier to judge its reliability.&lt;/p&gt;
&lt;p&gt;For example, if you ask about how a particular sensor works, a good answer might
describe the physical principle behind it. If the answer instead gives vague
phrases such as "advanced technology" or "cutting edge performance", it may not
be providing real information.&lt;/p&gt;
&lt;h2 id="notice-when-the-system-sounds-certain"&gt;Notice When the System Sounds Certain&lt;/h2&gt;
&lt;p&gt;Chat systems often express ideas in a confident tone, even when the underlying
information is uncertain. This is a normal behaviour of the technology, but it
means that confidence should not be taken as a sign of accuracy.&lt;/p&gt;
&lt;p&gt;A relatable example is when someone asks for the opening hours of a local shop.
The system may provide a clear answer, but unless it has access to current
information, the hours may be outdated or incorrect. The tone does not reflect
the reliability.&lt;/p&gt;
&lt;h2 id="compare-the-output-with-what-you-already-know"&gt;Compare the Output With What You Already Know&lt;/h2&gt;
&lt;p&gt;If the response touches on a topic you understand, a quick comparison can reveal
whether the system is on the right track. If something feels inconsistent with
your knowledge, it may be worth checking further.&lt;/p&gt;
&lt;p&gt;For instance, if you ask about a programming concept you use regularly and the
answer describes it in an unfamiliar way, that is a signal to verify the
information.&lt;/p&gt;
&lt;h2 id="ask-for-clarification-or-a-different-angle"&gt;Ask for Clarification or a Different Angle&lt;/h2&gt;
&lt;p&gt;If a response seems incomplete or unclear, asking the system to explain the idea
in a different way can help. Many people find that asking for an example, a
step‑by‑step explanation, or a simpler description reveals whether the system
actually captured the idea.&lt;/p&gt;
&lt;p&gt;A practical example is when someone asks for an explanation of a financial
term. If the first answer feels abstract, asking for "a simple example using
everyday numbers" often makes the concept clearer.&lt;/p&gt;
&lt;h2 id="be-cautious-with-sensitive-or-highimpact-topics"&gt;Be Cautious With Sensitive or High‑Impact Topics&lt;/h2&gt;
&lt;p&gt;Some areas require extra care. These include medical advice, legal
interpretation, financial decisions, and safety‑critical information. Chat
systems can generate plausible text in these areas, but plausibility is not the
same as accuracy.&lt;/p&gt;
&lt;p&gt;A symptom checker example illustrates this. A system may describe a condition in
a way that sounds precise, but it cannot assess real‑world risk or context. In
such cases, the output should be treated as general information, not as a basis
for action.&lt;/p&gt;
&lt;h2 id="look-for-signs-of-fabrication"&gt;Look for Signs of Fabrication&lt;/h2&gt;
&lt;p&gt;Chat systems sometimes produce details that sound real but are not. These may
include invented citations, incorrect statistics, or descriptions of events
that never occurred. This behaviour is not intentional, but it can mislead
readers who assume the information is factual.&lt;/p&gt;
&lt;p&gt;A common example is when someone asks for a reference to a scientific paper and
receives a title and author that look plausible but do not exist. Checking the
reference quickly reveals the issue.&lt;/p&gt;
&lt;h2 id="use-the-system-as-a-tool-not-an-authority"&gt;Use the System as a Tool, Not an Authority&lt;/h2&gt;
&lt;p&gt;A chat system can be a helpful assistant for drafting, exploring ideas, or
learning about a topic. It is less suited to acting as a final source of
truth.  Treating it as a tool rather than an authority helps keep expectations
realistic and reduces the risk of relying on incorrect information.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Evaluating the output of an AI chat session is a practical skill. Paying
attention to the purpose of the conversation, the clarity of the answer, the
presence of verifiable details, and the sensitivity of the topic can make the
experience more effective and safer. With a few simple habits, it becomes
easier to recognise when the system is providing useful insight and when
additional checking is needed.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="how-to-use.html"&gt;Guidance on using AI safely and effectively, grounded in recent examples of misuse and emerging best practices.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="how-ai-works.html"&gt;An explanation of how large language models actually function and why they should not be treated as miniature humans.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="what-ai-is.html"&gt;A clear explanation of what AI is—and is not—cutting through hype to define its real capabilities and limits.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#how-to-evaluate-the-output-of-an-ai-chat-session"&gt;How to Evaluate the Output of an AI Chat Session&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#introduction"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#start-with-the-purpose-of-the-conversation"&gt;Start With the Purpose of the Conversation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#check-whether-the-output-matches-the-question"&gt;Check Whether the Output Matches the Question&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#look-for-verifiable-details"&gt;Look for Verifiable Details&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#notice-when-the-system-sounds-certain"&gt;Notice When the System Sounds Certain&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#compare-the-output-with-what-you-already-know"&gt;Compare the Output With What You Already Know&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ask-for-clarification-or-a-different-angle"&gt;Ask for Clarification or a Different Angle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#be-cautious-with-sensitive-or-highimpact-topics"&gt;Be Cautious With Sensitive or High‑Impact Topics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#look-for-signs-of-fabrication"&gt;Look for Signs of Fabrication&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#use-the-system-as-a-tool-not-an-authority"&gt;Use the System as a Tool, Not an Authority&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Foundations"></category></entry><entry><title>How to Use AI Safely and Effectively</title><link href="https://phroneses.com/articles/foundations/notes/how-to-use.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/foundations/notes/how-to-use.html</id><summary type="html">&lt;p&gt;Guidance on using AI safely and effectively, grounded in recent examples of misuse and emerging best practices.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc#"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Recent headlines have shown the same unsettling pattern.&lt;/p&gt;
&lt;p&gt;An AI system confidently generated legal cases that never existed, as reported when UK courts received filings built on fictitious case law (The Guardian, Scottish Legal News).&lt;/p&gt;
&lt;p&gt;Health researchers have warned that AI can give medical guidance that is not just inaccurate but dangerously misleading. A British Medical Journal article as reported in the Independent stated that 20% of AI medical answers were "highly problematic".&lt;/p&gt;
&lt;p&gt;And tech reporters have documented AI‑generated news summaries that included entirely fabricated headlines and events (Sky News).&lt;/p&gt;
&lt;p&gt;In every case, the system generated output that communicated total confidence. In every case, the AI was wrong. Fluency is not understanding. Appearing proficient is not accuracy. This confusion is exactly where the real risk lies.&lt;/p&gt;
&lt;h1 id="give-clear-instructions"&gt;Give Clear Instructions&lt;/h1&gt;
&lt;p&gt;AI works best when you tell it exactly what you want. It does not infer your intentions or read between the lines. The output you see is a statistical software prediction based on patterns in the training data of the AI. The clearer your request, the better the output.&lt;/p&gt;
&lt;p&gt;Start by stating your goal. Instead of asking, "Tell me about climate change," try: "Give me a 150‑word summary of the main causes of climate change for a general audience." A specific target gives the system's statistical pattern-matching something concrete to aim at.&lt;/p&gt;
&lt;p&gt;Set the format you want. Simple instructions like "Give me three options," "Write this as a short email," or "List the steps in order" immediately improve the result. Format acts as a constraint, and constraints make the output sharper.&lt;/p&gt;
&lt;p&gt;Define the audience. AI changes tone and detail depending on who you say it is for: beginners, executives, customers, or the general public. A single line about the audience can transform the clarity of the answer.&lt;/p&gt;
&lt;p&gt;If accuracy matters, add constraints such as "Use widely accepted information," "If you’re unsure, say so," or "Do not invent details." These reduce the risk of confident mistakes.&lt;/p&gt;
&lt;p&gt;Clear instructions make the output better and safer, but they do not eliminate the risk of mistakes. Even with perfect prompts, a system can still deliver something that sounds certain but is completely wrong.&lt;/p&gt;
&lt;p&gt;The AI is not weighing evidence or checking facts. AI is programmed to produce an answer that appears most likely based on patterns in its training data. When those patterns point in the wrong direction, the result is a confident mistake. Your prompt has to help the AI navigate any bias or missing data in its training data. Think of your prompt as you nudging the AI in the direction you want to go.&lt;/p&gt;
&lt;p&gt;When your task is large, break it into smaller steps. Ask for an outline first, then expand each section. AI performs far better when guided step‑by‑step.&lt;/p&gt;
&lt;p&gt;Clear instructions don’t just improve the output, they keep you in control of the process.&lt;/p&gt;
&lt;h1 id="provide-enough-context"&gt;Provide Enough Context&lt;/h1&gt;
&lt;p&gt;AI performs noticeably better when it has the background information it needs, such as who the audience is, what the situation involves, or what constraints apply.&lt;/p&gt;
&lt;p&gt;When context is missing, the system often fills in the gaps with incorrect predictions that will look like guesses, and recent reporting shows how easily this can go wrong. The Guardian found that Google AI Overviews gave misleading health advice because the AI responded without understanding the medical circumstances involved, including a case where it advised pancreatic cancer patients to avoid high fat foods, which experts described as really dangerous. This is dangeous advice as some who suffer from pancreatic cancer are malnourished and consuming fat can be a nutritionally efficient way to ingest energy.&lt;/p&gt;
&lt;h1 id="check-the-output-carefully"&gt;Check the Output Carefully&lt;/h1&gt;
&lt;p&gt;AI is not a source of truth, it is a generator of plausible answers, so treat every response as a draft, not a verdict.&lt;/p&gt;
&lt;p&gt;Read the answer to then ask basic questions: Does this match what you already know, does it contradict trusted sources, does anything feel too neat or too extreme?&lt;/p&gt;
&lt;p&gt;For factual topics, spot check key claims against reputable outlets or official documentation, especially numbers, names, dates, web links, and legal or medical details.&lt;/p&gt;
&lt;p&gt;For writing tasks, look for invented quotes, fake references, or details that are oddly specific without any support.&lt;/p&gt;
&lt;p&gt;If something important hinges on the answer, ask the system to show its reasoning, to list uncertainties, or to offer alternative possibilities.&lt;/p&gt;
&lt;p&gt;The core habit is simple: never confuse a confident tone with a reliable answer. Once you see the answer you can ask the AI more questions to check the reliability of that answer. This is especially important if you are going to do something that relies on that answer.&lt;/p&gt;
&lt;h1 id="use-ai-for-the-right-tasks"&gt;Use AI for the Right Tasks&lt;/h1&gt;
&lt;p&gt;AI is most effective when the work involves drafting, summarising, organising ideas, exploring options, or speeding up early stage thinking.&lt;/p&gt;
&lt;p&gt;AI can turn rough notes into a clean paragraph, reshape a long document into a shorter one, or generate several ways to frame a problem so you can choose the best one.&lt;/p&gt;
&lt;p&gt;AI is also useful for outlining reports, comparing approaches, rewriting for different audiences, or helping you see alternatives you might not have considered. These are tasks where speed and structure matter more than perfect accuracy. You can make text accurate later.&lt;/p&gt;
&lt;p&gt;AI is far less reliable when the task requires expert judgment, real world verification, or precise factual detail, so keep it focused on the parts of the job where it can genuinely help rather than the parts where it can get you into trouble.&lt;/p&gt;
&lt;p&gt;Keep in mind that AI is not thinking. AI does not check for truth. It generates plausible text based on its training data.&lt;/p&gt;
&lt;h1 id="avoid-using-ai-for-judgement-or-decisions"&gt;Avoid Using AI for Judgement or Decisions&lt;/h1&gt;
&lt;p&gt;AI cannot weigh values, consequences, or ethics, and it cannot understand the human context that sits behind real decisions.&lt;/p&gt;
&lt;p&gt;AI can offer options, outline trade offs, or summarise information, but it cannot decide what matters most, what is acceptable, or what is fair. Those choices rely on experience, responsibility, and an understanding of people, none of which an AI possesses.&lt;/p&gt;
&lt;p&gt;Use AI to support your thinking, not to replace it. Human judgement must stay in charge, especially when the outcome affects safety, wellbeing, trust, or the outcome has long term consequences.&lt;/p&gt;
&lt;h1 id="be-cautious-with-personal-or-sensitive-information"&gt;Be Cautious with Personal or Sensitive Information&lt;/h1&gt;
&lt;p&gt;Treat AI tools the same way you would treat an online form or an email to someone you do not know.&lt;/p&gt;
&lt;p&gt;Do not share details that could identify you, expose someone else, or create problems if they were ever seen by the wrong person. This includes financial information, medical records, passwords, private conversations, or anything that involves children, colleagues, or business clients.&lt;/p&gt;
&lt;p&gt;Keep the boundary simple. If you would hesitate before typing it into a website, keep it out of an AI prompt. The safest approach is to describe the situation in general terms and remove anything that is not essential to the task. This protects your privacy and prevents sensitive information from being handled in ways you cannot control.&lt;/p&gt;
&lt;h1 id="compare-answers-with-reliable-sources"&gt;Compare Answers with Reliable Sources&lt;/h1&gt;
&lt;p&gt;Treat AI output as a starting point, not a final answer, and cross check anything that matters with sources you trust.&lt;/p&gt;
&lt;p&gt;This is especially important for facts that are time sensitive, technical, or likely to change. A quick comparison with reputable news outlets, official guidance, or well established reference material can reveal errors that are easy to miss when the writing sounds polished.&lt;/p&gt;
&lt;p&gt;This habit is not about distrusting the tool, it is about protecting yourself from mistakes that come from outdated information, missing context, or confident AI guesses. When accuracy matters, a second source is not optional, it is part of the process.&lt;/p&gt;
&lt;h1 id="keep-an-eye-out-for-gaps-or-oddities"&gt;Keep an Eye Out for Gaps or Oddities&lt;/h1&gt;
&lt;p&gt;A useful habit when reading AI generated answers is to notice when something feels slightly off. This might be an explanation that is too vague, a claim that is oddly specific without support, or a confident statement that does not match what you know.&lt;/p&gt;
&lt;p&gt;When you see these signs, pause and ask a follow up question or check the detail elsewhere.&lt;/p&gt;
&lt;p&gt;Recent reporting shows how easily small oddities can signal a deeper problem. The Guardian described how a senior European journalist was suspended after using AI tools to summarise material and then publishing quotes that the people involved had never said. The investigation found dozens of invented statements that looked polished and authoritative but were entirely false, and the journalist admitted he had fallen into the trap of trusting text that only sounded right.&lt;/p&gt;
&lt;p&gt;Examples like this show why readers should stay alert to gaps, inconsistencies, or moments when an answer feels too neat. These are cues to check the AI's output.&lt;/p&gt;
&lt;h1 id="stay-aware-of-the-limits-of-ai"&gt;Stay Aware of the Limits of AI&lt;/h1&gt;
&lt;p&gt;AI does not understand meaning, it has no lived experience, and it cannot draw on intuition or common sense.&lt;/p&gt;
&lt;p&gt;AI works by recognising patterns in data and producing text that fits those patterns, not by grasping the reality behind the words. This means it can miss context, overlook nuance, or present something that sounds authoritative without any understanding.&lt;/p&gt;
&lt;p&gt;AI cannot feel uncertainty, it cannot judge what is important, and it cannot
tell when it has made a mistake. Keeping these limits in mind helps you use the
tool for what it is good at and avoid expecting it to behave like a person.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="how-ai-works.html"&gt;An explanation of how large language models actually function and why they should not be treated as miniature humans.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai-chatbot.html"&gt;A practical guide to assessing the quality, reliability, and safety of AI chat session outputs.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="ai-chatbot-prompting.html"&gt;Ten simple AI workflows that save minutes each day and compound into hours each week, helping people work more efficiently.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#give-clear-instructions"&gt;Give Clear Instructions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#provide-enough-context"&gt;Provide Enough Context&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#check-the-output-carefully"&gt;Check the Output Carefully&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#use-ai-for-the-right-tasks"&gt;Use AI for the Right Tasks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#avoid-using-ai-for-judgement-or-decisions"&gt;Avoid Using AI for Judgement or Decisions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#be-cautious-with-personal-or-sensitive-information"&gt;Be Cautious with Personal or Sensitive Information&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#compare-answers-with-reliable-sources"&gt;Compare Answers with Reliable Sources&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#keep-an-eye-out-for-gaps-or-oddities"&gt;Keep an Eye Out for Gaps or Oddities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#stay-aware-of-the-limits-of-ai"&gt;Stay Aware of the Limits of AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reading"&gt;Further Reading&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#fake-legal-cases"&gt;Fake legal cases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#dangerous-or-misleading-medical-advice"&gt;Dangerous or misleading medical advice&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fabricated-news-summaries"&gt;Fabricated news summaries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#misleading-health-advice"&gt;Misleading health advice&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#senior-european-journalist"&gt;Senior European Journalist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h1 id="further-reading"&gt;Further Reading&lt;/h1&gt;
&lt;h2 id="fake-legal-cases"&gt;Fake legal cases&lt;/h2&gt;
&lt;p&gt;The Guardian — &lt;a href="https://www.theguardian.com/technology/2025/jun/06/high-court-tells-uk-lawyers-to-urgently-stop-misuse-of-ai-in-legal-work"&gt;https://www.theguardian.com/technology/2025/jun/06/high-court-tells-uk-lawyers-to-urgently-stop-misuse-of-ai-in-legal-work&lt;/a&gt; &lt;br/&gt;
Scottish Legal News — &lt;a href="https://www.scottishlegal.com/articles/ai-chatbot-invented-legal-cases-in-taxpayers-failed-appeal-against-hmrc"&gt;https://www.scottishlegal.com/articles/ai-chatbot-invented-legal-cases-in-taxpayers-failed-appeal-against-hmrc&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="dangerous-or-misleading-medical-advice"&gt;Dangerous or misleading medical advice&lt;/h2&gt;
&lt;p&gt;The Independent — &lt;a href="https://www.independent.co.uk/life-style/health-and-families/health-news/chatbots-medical-advice-bmj-study-b2961005.html"&gt;https://www.independent.co.uk/life-style/health-and-families/health-news/chatbots-medical-advice-bmj-study-b2961005.html&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="fabricated-news-summaries"&gt;Fabricated news summaries&lt;/h2&gt;
&lt;p&gt;Sky News — &lt;a href="https://news.sky.com/story/apple-suspends-ai-generated-news-summaries-after-criticism-over-misleading-notifications-13290676"&gt;https://news.sky.com/story/apple-suspends-ai-generated-news-summaries-after-criticism-over-misleading-notifications-13290676&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="misleading-health-advice"&gt;Misleading health advice&lt;/h2&gt;
&lt;p&gt;The Guardian - &lt;a href="https://www.theguardian.com/technology/2026/jan/02/google-ai-overviews-risk-harm-misleading-health-information"&gt;https://www.theguardian.com/technology/2026/jan/02/google-ai-overviews-risk-harm-misleading-health-information&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="senior-european-journalist"&gt;Senior European Journalist&lt;/h2&gt;
&lt;p&gt;The Guardian - &lt;a href="https://www.theguardian.com/technology/2026/mar/20/mediahuis-suspends-senior-journalist-over-ai-generated-quotes?utm_source=copilot.com"&gt;https://www.theguardian.com/technology/2026/mar/20/mediahuis-suspends-senior-journalist-over-ai-generated-quotes?utm_source=copilot.com&lt;/a&gt;&lt;/p&gt;</content><category term="Foundations"></category></entry><entry><title>Latency is architecural</title><link href="https://phroneses.com/articles/build/notes/latecy-is-architectural.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/build/notes/latecy-is-architectural.html</id><summary type="html">&lt;p&gt;Most latency comes from retrieval hops and orchestration, not the model; RAG pipelines often recreate microservice-style chatter that slows systems down.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="latency-is-architectural"&gt;Latency is architectural&lt;/h1&gt;
&lt;p&gt;Most latency comes from retrieval hops, long prompts, and serial tool
calls. The model call is rarely the slow part. The pipeline is the
bottleneck. Optimise orchestration, not just the model.&lt;/p&gt;
&lt;p&gt;Engineers often assume the model is the slow part. It usually is not.  The real
drag comes from the machinery wrapped around it.&lt;/p&gt;
&lt;h2 id="retrieval-hops-cost-more-than-you-expect"&gt;Retrieval hops cost more than you expect&lt;/h2&gt;
&lt;p&gt;Every vector search, metadata filter, re‑rank, and chunk stitch is another
network hop.  Do that a few times and half your latency budget has vanished
before the model has even seen a token.  It is the old "too many microservices"
problem wearing a new badge.&lt;/p&gt;
&lt;h2 id="too-many-microservices"&gt;Too Many microservices&lt;/h2&gt;
&lt;p&gt;A system begins tidy, then grows arms and legs. Someone adds a retriever.
Someone adds a re‑ranker. Someone adds a metadata filter. Someone adds a chunk
stitcher.  Each piece looks harmless. Each piece solves a problem. But once
they are strung together, the whole thing slows to a crawl.&lt;/p&gt;
&lt;p&gt;RAG pipelines follow the same pattern. Instead of ten microservices, you now
have ten retrieval hops. Instead of service chatter, you have index chatter.
Instead of JSON bouncing around a cluster, you have embeddings and chunks being
passed across the network. The labels have changed, but the behaviour has not.&lt;/p&gt;
&lt;p&gt;In a microservice stack, services talk to each other all day long.
They pass JSON around, wait for replies, retry on failure, and generally keep
the network busy. That is service chatter.&lt;/p&gt;
&lt;p&gt;In a RAG stack, the same noise comes from your retrieval layer. The actors are
different, but the behaviour is the same. Your vector index, keyword index,
metadata store, and re‑ranker all talk to each other. They pass embeddings,
scores, filters, and chunks back and forth. Each hop is another round trip. Each
hop adds delay. Each hop adds another place for things to wobble.&lt;/p&gt;
&lt;p&gt;It is chatter because none of it is real work from the user’s point of view.
The user wants an answer. The system spends most of its time gossiping between
indexes about which chunk might be relevant. It is busy, but not productive.&lt;/p&gt;
&lt;p&gt;The point is simple. You have replaced one kind of internal noise with another.
The labels have changed, but the cost has not. If you let the retrieval layer
grow without discipline, it will behave exactly like an over‑eager microservice
mesh. It will talk too much, wait too long, and slow everything down.&lt;/p&gt;
&lt;p&gt;Every hop adds latency. Every hop adds a failure mode.  Every hop adds mental
overhead. Hop latency accumulates in the end-to-end-pipelines. The job becomes
debugging the plumbing rather than improving the product.  The system becomes
sluggish, brittle, and full of odd surprises.&lt;/p&gt;
&lt;p&gt;The lesson is the same as it was during the microservice boom. Keep the number
of moving parts low. Keep the boundaries clear. Keep the data local whenever you
can. If you do not, the pipeline will drag, no matter how fast the model is.&lt;/p&gt;
&lt;h2 id="leaving-the-process-costs-you"&gt;Leaving the process costs you&lt;/h2&gt;
&lt;p&gt;Vector search is typical for RAG, but it is not the only culprit.  Any
retrieval layer that reaches across the network will cost you time.  It does
not matter whether you use a vector index, a keyword index, a hybrid index, or
a bespoke store.  If you have to leave the process, hit a service, wait for it
to return, and then stitch the results back together, you will pay for it in
latency.&lt;/p&gt;
&lt;h2 id="long-prompts-are-silent-killers"&gt;Long prompts are silent killers&lt;/h2&gt;
&lt;p&gt;Sending 200,000 tokens into a model is not free. As of April 2026, GPT-5.5 is
USD 5.00 per 1 million tokens, so USD 1 for 200k tokens. This might not sound
much but if your whole AI system that is made up from multiple pipelines calls
OpenAI a thousand times in an eight-hour period, that is one call every 86
seconds, costing USD 1,000 per day. As you introduce features that rely on AI,
this cost can balloon.&lt;/p&gt;
&lt;p&gt;You pay for tokenisation, network transfer, and ingestion.  It is the
equivalent of posting a novel every time you want a paragraph back.  Shorter
prompts are not only cheaper, they are faster and far easier to reason about.&lt;/p&gt;
&lt;p&gt;Cloud costs balloon because the pricing model rewards scale until it punishes
you. Everything looks cheap at the start. A few API calls here, a small vector
index there, a modest GPU for a prototype. Then the system goes live, traffic
rises, and the bill climbs faster than the usage graph.&lt;/p&gt;
&lt;p&gt;The pattern is predictable. You pay for every hop, every lookup, every token,
every gigabyte, and every idle minute. The cloud does not care whether the work
was useful. It charges for activity, not value.&lt;/p&gt;
&lt;p&gt;RAG pipelines are especially prone to this. Retrieval is chatty. Each query
touches several indexes. Each index has its own storage, compute, and network
fees. The model call is only one line on the invoice. The real cost comes from
the scaffolding wrapped around it.&lt;/p&gt;
&lt;p&gt;Costs balloon because the architecture balloons. More hops. More services. More
indexes. More caching layers. More background jobs. More monitoring. More logs.
Every piece adds a little cost. Together they add a lot.&lt;/p&gt;
&lt;p&gt;The cloud makes it easy to scale up, but it does not make it easy to scale down.
Once the system is busy, you pay for the peaks, not the averages. You pay for
the buffers, the replicas, and the safety margins. You pay for the comfort of
not waking up at three in the morning.&lt;/p&gt;
&lt;p&gt;The cloud invoice is driven by the highest sustained load, not the gentle
baseline you see on a dashboard.&lt;/p&gt;
&lt;p&gt;Cloud platforms charge for capacity, not comfort. When traffic spikes, the
system scales out. Extra replicas spin up. Buffers grow. Queues stretch. More
storage is touched. More network is consumed. The platform does not scale back
the instant the spike ends. It holds the extra capacity for safety, stability,
and headroom. You pay for that headroom.&lt;/p&gt;
&lt;p&gt;The average load might look modest, but the cloud does not bill you on the
average. It bills you on the resources that were provisioned to survive the
worst ten minutes of the day. If your peak is ten times your baseline, your
bill will reflect the peak, not the baseline.&lt;/p&gt;
&lt;p&gt;The only defence is discipline. Keep the design lean. Keep the hops few. Keep
the data local. Keep the retrieval tight. Keep the prompts short. Keep the
pipeline simple. If you do not, the cloud bill will grow faster than the user
base, and it will not stop until you force it to.&lt;/p&gt;
&lt;h1 id="serial-tool-calls-turn-your-pipeline-into-treacle"&gt;Serial tool calls turn your pipeline into treacle&lt;/h1&gt;
&lt;p&gt;If your workflow is LLM → tool → LLM → tool → LLM, you have built a queue, not
a pipeline.  Everything waits for everything else.  It is the same anti‑pattern
that made synchronous RPC chains painful in the early microservice era.&lt;/p&gt;
&lt;p&gt;A queue and a pipeline look similar on a whiteboard, but they behave very
differently once traffic hits them. The distinction matters, because one keeps
work moving and the other forces everything to wait its turn.&lt;/p&gt;
&lt;p&gt;A queue is a stop‑start system. Each step blocks until the previous step has
finished. Nothing can overtake anything else. If one stage slows down, the
entire flow backs up behind it. This is what happens when you chain LLM calls
and tools in a strict sequence. The second LLM call cannot begin until the tool
has replied. The tool cannot run until the first LLM call has finished. The
whole thing becomes a single‑file line.&lt;/p&gt;
&lt;p&gt;A pipeline is a flow system. Work moves through independent stages that can run
at the same time. Stage one can process ithe next item while stage two handles item
one. Throughput rises because the stages overlap. The system does not wait for
each piece to finish before starting the next. This is how high‑volume systems
stay fast even when individual steps are slow.&lt;/p&gt;
&lt;p&gt;A queue waits for the whole journey.  A pipeline hands work off and moves on.&lt;/p&gt;
&lt;p&gt;The handoff is the key. Once a stage can pass work downstream and start the
next item without waiting, you have built a pipeline, not a queue.&lt;/p&gt;
&lt;p&gt;The problem with LLM → tool → LLM → tool → LLM is that it behaves like a queue.
Every step waits for the previous one. There is no overlap, no parallelism, and
no slack. One slow tool call stalls the entire chain. It is the same pattern
that made synchronous RPC chains painful in early microservice designs. The
system is busy, but nothing is flowing.&lt;/p&gt;
&lt;p&gt;The lesson is simple. If you want speed, build a pipeline. If you build a queue,
do not be surprised when everything crawls.&lt;/p&gt;
&lt;h4 id="4-orchestration-overhead-accumulates"&gt;4. Orchestration overhead accumulates&lt;/h4&gt;
&lt;p&gt;Glue code, JSON wrangling, retries, fallbacks, schema checks, and all the other
dull bits. Each one is tiny. Each one feels harmless. Together they slow the
system more than any single model call ever will.&lt;/p&gt;
&lt;p&gt;The overhead hides in plain sight. A few milliseconds to validate a schema. A
few more to serialise a payload. A few more to deserialise it. A few more to
retry a flaky call. A few more to merge two partial results. None of these
steps look expensive on their own. They are not. The cost comes from the fact
that you do them on every request, across every stage, under load.&lt;/p&gt;
&lt;p&gt;This is why orchestration overhead is so deceptive. It does not arrive as one
big hit. It arrives as a hundred small ones. It is death by a thousand cuts.
The pipeline spends more time preparing to do work than doing the work.&lt;/p&gt;
&lt;p&gt;The worst part is that this overhead grows with complexity. Add one more tool
call, and you add one more round of serialisation. Add one more fallback, and
you add one more branch to evaluate. Add one more schema, and you add one more
validation pass. The system becomes a tangle of tiny chores.&lt;/p&gt;
&lt;p&gt;This is usually where the real time goes. Not in the model. Not in the vector
search. Not in the database. In the glue. In the stitching. In the invisible
admin that surrounds every step. The only fix is discipline: fewer hops, fewer
formats, fewer retries, fewer moving parts. The less you orchestrate, the
faster everything becomes.&lt;/p&gt;
&lt;h1 id="the-model-is-rarely-the-bottleneck"&gt;The model is rarely the bottleneck&lt;/h1&gt;
&lt;p&gt;Modern inference is GPU‑accelerated and heavily optimised. Your RAG stack is a
distributed system full of I/O, hops, and blocking calls.  Optimising the model
while ignoring the pipeline is like tuning the engine while the tyres are flat.
The power is there, but the car still drags.&lt;/p&gt;
&lt;p&gt;Modern LLM inference is brutally efficient. The kernels are fused. The memory
access patterns are tuned. The batching is tight. The GPUs run flat out. The
model is rarely the slow part. It is the most optimised component in the entire
stack, because it has to be. Vendors pour millions into shaving microseconds
from calculation paths.&lt;/p&gt;
&lt;p&gt;Your RAG pipeline is the opposite. It is a distributed system stitched together
from storage calls, network hops, serialisation steps, retries, and blocking
operations. Every part of it waits for something else. Every hop crosses a
boundary. Every boundary adds latency. The model is a rocket engine bolted to a
shopping trolley.&lt;/p&gt;
&lt;p&gt;This is why polishing the model is the wrong instinct. You can shave 10 percent
off inference time and never notice it, because the pipeline is burning that
time several times over in glue code and I/O. The GPU is idle while your
retriever fetches chunks. The retriever is idle while your re‑ranker waits for
a schema check. The re‑ranker is idle while your orchestrator serialises JSON.
The whole system is dominated by the slowest, least optimised parts.&lt;/p&gt;
&lt;p&gt;The handbrake is the pipeline. The bonnet is the model. Shining the bonnet does
not make the car move. Releasing the handbrake does. If you want real speed,
you fix the hops, the queues, the blocking calls, the retries, the formats, and
the orchestration. That is where the time goes. That is where the wins are.&lt;/p&gt;
&lt;h1 id="throughput-beats-singlequery-latency"&gt;Throughput beats single‑query latency&lt;/h1&gt;
&lt;p&gt;In a real system, throughput matters more than shaving a few milliseconds off a single request.&lt;br/&gt;
Throughput keeps queues short, users calm, and servers steady.&lt;br/&gt;
A system that flows well will always outperform a system that only looks fast in isolation.&lt;/p&gt;
&lt;p&gt;A design that includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;parallel retrieval  &lt;/li&gt;
&lt;li&gt;batched vector queries  &lt;/li&gt;
&lt;li&gt;cached embeddings  &lt;/li&gt;
&lt;li&gt;pre‑computed context  &lt;/li&gt;
&lt;li&gt;non‑blocking tool calls  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;will outrun a "fast" single‑query setup every day of the week.&lt;/p&gt;
&lt;p&gt;Think like a backend engineer, not a demo builder.&lt;br/&gt;
Design for flow, not fireworks.&lt;/p&gt;
&lt;h1 id="evaluation-must-be-continuous"&gt;Evaluation must be continuous&lt;/h1&gt;
&lt;p&gt;LLM behaviour drifts. Model updates shift outputs. Data changes. Prompt
templates evolve. Retrieval indexes age. Static tests decay. Continuous
evaluation with real traffic patterns is the only stable approach.&lt;/p&gt;
&lt;p&gt;LLMs are not fixed points. They are moving systems. Vendors update weights.
Safety layers change. Tokenisers shift. Even subtle adjustments can alter how a
model interprets a prompt or ranks retrieved context. A test that passed last
month can fail today without any change in your code.&lt;/p&gt;
&lt;p&gt;Your data is not fixed either. Documents are added, removed, rewritten, or
re‑indexed. Embeddings drift as models change. Metadata grows stale. A retrieval
query that once surfaced the right chunk may surface something weaker six weeks
later. The index ages, and the quality of the answer ages with it.&lt;/p&gt;
&lt;p&gt;An embedding will turn a sentence into a list of numbers where similar items
end up close together.&lt;/p&gt;
&lt;p&gt;Prompt templates evolve as well. You tweak wording. You add guardrails. You
change formatting. You introduce new variables. Each change shifts behaviour in
ways that are hard to predict. A small edit can ripple through the entire
pipeline.&lt;/p&gt;
&lt;p&gt;Static tests cannot keep up with this movement. They freeze expectations in
time. They assume the system is stable. It is not. The tests decay because the
system they measure is drifting underneath them. A green test suite can give a
false sense of confidence while the live system quietly degrades.&lt;/p&gt;
&lt;p&gt;The only reliable approach is continuous evaluation with real traffic patterns.
You must measure quality under the same conditions the system actually faces:
real prompts, real retrieval noise, real user phrasing, real edge cases, real
load. Automated reality is required. This is the only way to detect drift early
and correct it before it becomes visible to users.&lt;/p&gt;
&lt;p&gt;The system is alive. The evaluation must be alive with it.&lt;/p&gt;
&lt;h1 id="guardrails-must-be-layered"&gt;Guardrails must be layered&lt;/h1&gt;
&lt;p&gt;No single guardrail is enough. Combine input checks, retrieval filters, prompt
constraints, output checks, and post‑processing. Each layer catches different
failures. One layer alone invites outages.&lt;/p&gt;
&lt;p&gt;Guardrails fail for different reasons. Input checks catch malformed or hostile
queries, but they cannot see what retrieval will surface. Retrieval filters
remove unsafe or irrelevant chunks, but they cannot stop a prompt template from
mis‑framing the task. Prompt constraints shape model behaviour, but they cannot
guarantee the model will obey them under stress. Output checks catch violations
after the fact, but they cannot prevent the model from producing them in the
first place. Post‑processing can clean up structure, but it cannot repair a
fundamentally wrong answer.&lt;/p&gt;
&lt;p&gt;Each layer has blind spots. Each layer has failure modes. Each layer protects a
different part of the system. When you stack them, the gaps do not align. When
you rely on one, the gaps are exposed.&lt;/p&gt;
&lt;p&gt;This is why single‑layer safety is fragile. A lone input filter cannot stop a
retrieval glitch. A lone output checker cannot stop a prompt injection. A lone
prompt template cannot stop a malformed chunk. A lone retrieval filter cannot
stop a model hallucination. Outages happen when one layer is asked to do the
job of five.&lt;/p&gt;
&lt;p&gt;A robust system uses layered defence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;input validation to reject malformed or hostile queries  &lt;/li&gt;
&lt;li&gt;retrieval filtering to control what context enters the model  &lt;/li&gt;
&lt;li&gt;prompt constraints to shape behaviour and reduce ambiguity  &lt;/li&gt;
&lt;li&gt;output checks to enforce structure and detect violations  &lt;/li&gt;
&lt;li&gt;post‑processing to normalise, redact, or correct  &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of these layers is perfect. Together they are resilient. That is the
point. Modern LLM systems fail in many small ways, not one big way. The only
stable approach is to catch small failures early, often, and repeatedly across
the pipeline.&lt;/p&gt;
&lt;h1 id="the-future-is-orchestration"&gt;The future is orchestration&lt;/h1&gt;
&lt;p&gt;The next wave is not bigger models. It is coordination across many specialised
models. It is managing context across workflows. It is building predictable
tool‑calling chains. LLMs are components now. The engineers who master
orchestration will shape what comes next.&lt;/p&gt;
&lt;p&gt;The era of single‑model systems is ending. One large model trying to do
everything is slow, expensive, and brittle. The future is a network of smaller,
focused models: one for retrieval, one for classification, one for planning,
one for extraction, one for reasoning, one for generation. Each model does one
job well. The value comes from how they work together.&lt;/p&gt;
&lt;p&gt;This shift changes the engineering challenge. It is no longer about squeezing
more tokens per second out of a GPU. It is about coordinating dozens of moving
parts without losing context, consistency, or latency. You must track state
across hops. You must pass partial results between models. You must ensure that
tools are called in the right order, with the right schema, at the right time.
You must keep the pipeline flowing even when individual components fail or
drift.&lt;/p&gt;
&lt;p&gt;Context management becomes a first‑class problem. You cannot rely on a single
prompt to hold everything. You need shared memory, structured state, and
workflow‑level constraints. You need to decide what each model should know,
what it should not know, and how to hand off information cleanly. The system
must behave like a team, not a monolith.&lt;/p&gt;
&lt;p&gt;Tool‑calling becomes a discipline of its own. You need predictable chains,
clear contracts, and stable interfaces. You need to design workflows that are
parallel where possible, serial only where necessary, and resilient everywhere.
The orchestration layer becomes the real engine of the system.&lt;/p&gt;
&lt;p&gt;This is why the next wave belongs to engineers who understand distributed
systems, workflow design, and pipeline optimisation. The models are powerful,
but the power is unlocked only when they are coordinated. The future is not a
bigger brain. It is a well‑run organisation of smaller brains working together.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Latency in LLM systems is dominated by architecture, not model speed. Most of
the delay comes from retrieval hops, network boundaries, prompt expansion, and
token‑level generation, so performance improves when you redesign the pipeline,
not when you tweak the prompt. Once you see this, it becomes obvious that long
prompts, scattered retrieval, and unnecessary round‑trips are the real cost
drivers, and that reducing latency means reducing work, not asking the model to
work faster.&lt;/p&gt;
&lt;p&gt;The practical conclusion is that throughput and batching matter more than
single‑query latency, retrieval must be minimised and localised, and prompts
must be aggressively shortened. Systems that treat latency as an architectural
problem become predictable and scalable; systems that treat it as a model
problem stay slow no matter which model they plug in.&lt;/p&gt;
&lt;p&gt;You can process the same amount of data while using fewer hops, fewer
round‑trips, using fewer tokens, and making fewer retrieval calls, fewer prompt
expansions, and fewer model invocations.&lt;/p&gt;
&lt;p&gt;It is not about shrinking the task. It is about shrinking the machinery
required to accomplish it.&lt;/p&gt;
&lt;p&gt;You keep the data volume the same, but you redesign the path so the system
touches that data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fewer times&lt;/li&gt;
&lt;li&gt;in fewer places&lt;/li&gt;
&lt;li&gt;with fewer transformations&lt;/li&gt;
&lt;li&gt;with fewer tokens&lt;/li&gt;
&lt;li&gt;with fewer model calls&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Same data, less orchestration.  That is why latency drops.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="evaluate-ai.html"&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="surface-area.html"&gt;AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="ai-engineering-team-based-ai.html"&gt;The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#latency-is-architectural"&gt;Latency is architectural&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#retrieval-hops-cost-more-than-you-expect"&gt;Retrieval hops cost more than you expect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#too-many-microservices"&gt;Too Many microservices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#leaving-the-process-costs-you"&gt;Leaving the process costs you&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#long-prompts-are-silent-killers"&gt;Long prompts are silent killers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#serial-tool-calls-turn-your-pipeline-into-treacle"&gt;Serial tool calls turn your pipeline into treacle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-model-is-rarely-the-bottleneck"&gt;The model is rarely the bottleneck&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#throughput-beats-singlequery-latency"&gt;Throughput beats single‑query latency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#evaluation-must-be-continuous"&gt;Evaluation must be continuous&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#guardrails-must-be-layered"&gt;Guardrails must be layered&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-future-is-orchestration"&gt;The future is orchestration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Build"></category></entry><entry><title>Chat Interface to System Component</title><link href="https://phroneses.com/articles/build/notes/surface-area.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/build/notes/surface-area.html</id><summary type="html">&lt;p&gt;AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="programmatic-interfaces-to-ai-systems"&gt;Programmatic Interfaces to AI Systems&lt;/h1&gt;
&lt;p&gt;We interact with AI systems through natural language. As engineers, we are
used to structured and predictable interfaces such as REST or gRPC.&lt;/p&gt;
&lt;p&gt;AI systems do not behave like that. Their outputs are probabilistic, and this
creates real challenges when we try to use them as components inside software
systems.&lt;/p&gt;
&lt;p&gt;Most current models behave like chat interfaces. What we need are models that
behave like reliable parts of an application.&lt;/p&gt;
&lt;p&gt;This article explains what is currently practical and how to build interfaces
that bring AI systems closer to the expectations of software engineering.&lt;/p&gt;
&lt;h1 id="the-challenge"&gt;The Challenge&lt;/h1&gt;
&lt;p&gt;Large language models (LLMs) generate text by predicting the next token. They
are not rules engines, parsers, or deterministic programs.&lt;/p&gt;
&lt;p&gt;An LLM's output is a probability distribution over the next token. The
distribution depends on the prompt, any conversation history you include, the
model’s internal weights, and the sampling parameters.&lt;/p&gt;
&lt;p&gt;Even with strict instructions, the model still performs this operation:&lt;/p&gt;
&lt;p&gt;"Select the next token that has the highest probability given the input so
far."&lt;/p&gt;
&lt;p&gt;That is probability, not logic.&lt;/p&gt;
&lt;p&gt;The practical approach is to apply prompt constraints that reduce the
likelihood of outputs that are not fit for purpose.&lt;/p&gt;
&lt;h1 id="prompt-constraints"&gt;Prompt Constraints&lt;/h1&gt;
&lt;p&gt;An LLM may return a result that does not fit the calling side. This is a
failure mode of the model.&lt;/p&gt;
&lt;p&gt;Each of the eight layers reduces the likelihood of a specific failure mode.
Together, they form a structured interface between the client code and the
model.&lt;/p&gt;
&lt;p&gt;This approach will make your code more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;predictable&lt;/li&gt;
&lt;li&gt;grounded in the provided context&lt;/li&gt;
&lt;li&gt;structured in both input and output&lt;/li&gt;
&lt;li&gt;controllable through explicit constraints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because LLMs are probabilistic, these layers cannot &lt;em&gt;eliminate&lt;/em&gt; failure modes.&lt;/p&gt;
&lt;p&gt;Other failure modes exist, but they are outside the scope of this section. The focus here is on the eight layers that address the most common issues.&lt;/p&gt;
&lt;h1 id="the-eight-layers"&gt;The Eight Layers&lt;/h1&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#identity"&gt;Identity&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#safety--compliance"&gt;Safety &amp;amp; Compliance&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#capability-boundaries"&gt;Capability Boundaries&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#output-format"&gt;Output Format&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#citation-rules"&gt;Citation Rules&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#rag-grounding"&gt;RAG Grounding&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#reasoning-strategy"&gt;Reasoning Strategy&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="#task-logic"&gt;Task Logic&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a id="identity"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="1-identity"&gt;1. Identity&lt;/h3&gt;
&lt;p&gt;Identity anchors the model’s role and prevents behavioural drift.  Without a
stable identity, the model may shift tone, adopt unintended personas, or
answer outside its intended domain.  This layer establishes &lt;em&gt;what the model
is&lt;/em&gt; and &lt;em&gt;what it is not&lt;/em&gt;, providing the behavioural foundation for all
the layers below.&lt;/p&gt;
&lt;p&gt;&lt;a id="safety--compliance"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="2-safety-compliance"&gt;2. Safety &amp;amp; Compliance&lt;/h3&gt;
&lt;p&gt;Safety and compliance constraints ensure the model minimises harmful,
disallowed, or high‑risk content.  This protects users, organisations, and
downstream systems.  It is essential for any public‑facing or regulated
deployment. This helps to ensure that the model behaves within acceptable
boundaries.&lt;/p&gt;
&lt;p&gt;&lt;a id="capability-boundaries"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="3-capability-boundaries"&gt;3. Capability Boundaries&lt;/h3&gt;
&lt;p&gt;LLMs tend to overreach. They might claim abilities they do not have or
fabricate tools, APIs, or actions.  This layer reduces the likelihood that
the model will perform operations outside its scope.  It keeps the system more
honest, more predictable, and aligned with its real capabilities.&lt;/p&gt;
&lt;p&gt;&lt;a id="output-format"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="4-output-format"&gt;4. Output Format&lt;/h3&gt;
&lt;p&gt;Programmatic systems require structured, unambiguous, machine‑readable output.
This layer enforces schemas, reduces the likelihood of format drift, and helps
to ensure downstream components can reliably parse responses.  It helps move
the model away from a conversational agent towards a dependable software
component.&lt;/p&gt;
&lt;p&gt;&lt;a id="citation-rules"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="5-citation-rules"&gt;5. Citation Rules&lt;/h3&gt;
&lt;p&gt;Citation rules enforce traceability and verifiability.  &lt;/p&gt;
&lt;p&gt;This layer reduces the likelihood of fabricated sources, invented URLs, and
unsupported claims.  This layer is essential for any system that must justify
its answers or provide evidence for its statements.&lt;/p&gt;
&lt;p&gt;&lt;a id="rag-grounding"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="6-rag-grounding"&gt;6. RAG Grounding&lt;/h3&gt;
&lt;p&gt;RAG grounding ensures the model uses only the supplied context as its source
of truth.  It damps down hallucinations by binding the model to provided
evidence.  This layer is the core of retrieval‑augmented generation and is
mandatory for knowledge‑grounded systems.&lt;/p&gt;
&lt;p&gt;This approach does not eliminate hallucinations but it will reduce them.&lt;/p&gt;
&lt;p&gt;&lt;a id="reasoning-strategy"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="7-reasoning-strategy"&gt;7. Reasoning Strategy&lt;/h3&gt;
&lt;p&gt;Reasoning strategy helps to stabilise the model’s logic.  It moves towards
stepwise thinking, disambiguation, and evidence‑first reasoning.  This layer
reduces subtle reasoning errors and improves consistency across complex tasks.&lt;/p&gt;
&lt;p&gt;&lt;a id="task-logic"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="8-task-logic"&gt;8. Task Logic&lt;/h3&gt;
&lt;p&gt;Task logic governs how the model interprets and executes user instructions.
It handles ambiguity, resolves contradictions, and decomposes multi‑part
tasks.  This layer ensures the model behaves reliably in real‑world, messy,
human‑language scenarios.&lt;/p&gt;
&lt;h1 id="the-eight-layer-stack"&gt;The Eight Layer Stack&lt;/h1&gt;
&lt;p&gt;These eight layers form a stack where each layer protects against a different class of LLM failure:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Prevents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Identity&lt;/td&gt;
&lt;td&gt;Drift, persona instability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safety &amp;amp; Compliance&lt;/td&gt;
&lt;td&gt;Harmful or non‑compliant output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capability Boundaries&lt;/td&gt;
&lt;td&gt;Overreach, fabricated abilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output Format&lt;/td&gt;
&lt;td&gt;Schema breakage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Citation Rules&lt;/td&gt;
&lt;td&gt;Unsupported claims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG Grounding&lt;/td&gt;
&lt;td&gt;Hallucination&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning Strategy&lt;/td&gt;
&lt;td&gt;Faulty logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task Logic&lt;/td&gt;
&lt;td&gt;Misinterpretation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Together, they create a more controlled and predictable calling-side
interface to an AI system.&lt;/p&gt;
&lt;h1 id="the-minimal-stack"&gt;The Minimal Stack&lt;/h1&gt;
&lt;p&gt;For any programmatic interaction with an LLM, three layers are essential:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Identity &lt;/li&gt;
&lt;li&gt;Capability Boundaries&lt;/li&gt;
&lt;li&gt;Output Format&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Identity prevents behavioural drift. Capability boundaries reduce the
likelihood of fabricated abilities, tools, or actions. Output format
constraints reduce the likelihood of schema drift, malformed JSON, and
downstream parsing failures.&lt;/p&gt;
&lt;p&gt;Drift from the required behaviour leads to calling‑side errors. These three
layers reduce the likelihood of the most fundamental failure modes.&lt;/p&gt;
&lt;h1 id="the-minimal-stack-for-rag"&gt;The Minimal Stack for RAG&lt;/h1&gt;
&lt;p&gt;Retrieval‑Augmented Generation (RAG) improves accuracy by supplying the model
with domain‑specific and up‑to‑date information from a document store. The
model uses this retrieved content to produce a grounded and human‑readable
response.&lt;/p&gt;
&lt;p&gt;RAG passes to the LLM your domain data that its answer is constrained to
be based on, using the LLM's language-processing features to produce a
human-friendly response. RAG reduces hallucinations and improves factual
accuracy.&lt;/p&gt;
&lt;p&gt;The minimal RAG stack consists of the three core layers, plus RAG Grounding
and Citation Rules. This creates a five‑layer baseline for any RAG system.&lt;/p&gt;
&lt;p&gt;These layers improve stability, reduce unsupported claims, and increase the
reliability of the final output.&lt;/p&gt;
&lt;p&gt;RAG Grounding ensures the model uses the retrieved content as its source of
truth. Citation Rules reduce the likelihood of invented sources and
unsupported statements.&lt;/p&gt;
&lt;p&gt;RAG is required when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;accuracy matters&lt;/li&gt;
&lt;li&gt;knowledge changes frequently&lt;/li&gt;
&lt;li&gt;domain‑specific expertise is required&lt;/li&gt;
&lt;li&gt;hallucinations are unacceptable&lt;/li&gt;
&lt;li&gt;answers must be auditable&lt;/li&gt;
&lt;li&gt;you need to integrate private or internal documents&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="the-minimal-stack-for-public-facing-systems"&gt;The Minimal Stack for Public-Facing Systems&lt;/h1&gt;
&lt;p&gt;Public‑facing systems require the five‑layer RAG stack plus Safety and Compliance.&lt;/p&gt;
&lt;p&gt;These six layers form the minimum configuration for any system exposed to real users. They address:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;behavioural stability&lt;/li&gt;
&lt;li&gt;safety&lt;/li&gt;
&lt;li&gt;overreach damping&lt;/li&gt;
&lt;li&gt;structured output&lt;/li&gt;
&lt;li&gt;evidence requirements&lt;/li&gt;
&lt;li&gt;grounding to damp down hallucinations&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="the-full-8-layer-stack"&gt;The Full 8 Layer Stack&lt;/h1&gt;
&lt;p&gt;The final two layers are Reasoning Strategy and Task Logic.&lt;/p&gt;
&lt;p&gt;Reasoning strategy is required when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the model must break problems into steps&lt;/li&gt;
&lt;li&gt;ambiguity must be resolved before answering&lt;/li&gt;
&lt;li&gt;shallow or shortcut reasoning would cause errors&lt;/li&gt;
&lt;li&gt;the system must justify or stabilise its logic&lt;/li&gt;
&lt;li&gt;you want consistent reasoning across varied prompts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This layer reduces subtle reasoning failures that grounding alone cannot address.&lt;/p&gt;
&lt;p&gt;Task Logic is required when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;instructions are complex or multi‑part&lt;/li&gt;
&lt;li&gt;instructions conflict or require prioritisation&lt;/li&gt;
&lt;li&gt;tasks must be decomposed before execution&lt;/li&gt;
&lt;li&gt;the system must handle unstructured or ambiguous input&lt;/li&gt;
&lt;li&gt;consistent behaviour is required across varied task types&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This layer helps ensure the model interprets and executes instructions
correctly.&lt;/p&gt;
&lt;h1 id="using-the-eight-layers-in-code"&gt;Using the Eight Layers in Code&lt;/h1&gt;
&lt;h2 id="openais-api-is-stateless"&gt;OpenAI's API is Stateless&lt;/h2&gt;
&lt;p&gt;Note: OpenAI’s APIs are stateless by default. Each request only contains the
context you explicitly send. Each text generation request is independent and
stateless. Therefore, multi‑turn conversations only occur when you manually
include previous messages in the request. The code below has no requirement to
do this and so such a history is not present. If it was, later answers would
be influenced by earlier queries and this is not required for this
interaction.&lt;/p&gt;
&lt;p&gt;With OpenAIi, you can use a conversation memory. This is possible with OpenAI
features such as conversation, previous_response_id (Responses API) or the
Agents SDK’s session memory. &lt;/p&gt;
&lt;h2 id="coding-the-eight-layers"&gt;Coding the Eight Layers&lt;/h2&gt;
&lt;p&gt;The approach here is to represent each layer as a dictionary that always has a
'role' key (set to 'system' or 'user'). The other keys are used to define a
standard set of values. When passed to OpenAI's API, each dictionary is
processed to build an OpenAI API-compatible dictionary which consists of just
'role' and 'content'.&lt;/p&gt;
&lt;p&gt;'content' is constructed from the non-role values below.&lt;/p&gt;
&lt;p&gt;We can imagine each dictionary being retrieved from a configuration store and
the keys are just names for the associated value. These names enable you to
discuss constraint types per layer. It is the values that become part of
'content'.&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. Identity Layer&lt;/span&gt;
    &lt;span class="n"&gt;system_identity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"identity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"You are a retrieval‑augmented assistant."&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Safety &amp;amp; Compliance Layer&lt;/span&gt;
&lt;span class="n"&gt;system_safety_compliance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Core safety principles&lt;/span&gt;
    &lt;span class="s2"&gt;"no_harm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must not provide harmful, dangerous, or abusive content."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_illegal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must not assist with illegal activities, evasion, or wrongdoing."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_personal_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must not request, store, or infer personal data about real individuals."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_medical_advice"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must not provide medical, legal, or financial advice beyond what is explicitly allowed."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_sensitive_inference"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must not infer protected attributes (race, religion, health, etc.)."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Refusal behaviour&lt;/span&gt;
    &lt;span class="s2"&gt;"refusal_style"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"If a request violates safety rules, the assistant must refuse clearly and briefly."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"refusal_format"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Refusals must be one sentence, factual, and non‑judgmental."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"refusal_no_elaboration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do not provide workarounds, alternatives, or detailed explanations when refusing."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Compliance priority&lt;/span&gt;
    &lt;span class="s2"&gt;"compliance_overrides"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Safety and compliance rules override all other instructions, including user requests."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_conflicting_instructions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"If user instructions conflict with safety rules, follow safety rules."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Capability Boundaries Layer&lt;/span&gt;
&lt;span class="n"&gt;system_capability_boundaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Allowed capabilities&lt;/span&gt;
    &lt;span class="s2"&gt;"allowed_scope"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"Interpret user questions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Use ONLY the provided context for answers."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Produce structured JSON according to the schema."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Explain reasoning based solely on the context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Quote exact lines from the context when required."&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;

    &lt;span class="c1"&gt;# Disallowed capabilities&lt;/span&gt;
    &lt;span class="s2"&gt;"disallowed_scope"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"Do NOT use external knowledge."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Do NOT invent facts, labels, or citations."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Do NOT answer questions outside the provided context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Do NOT perform tasks requiring tools, browsing, or external systems."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Do NOT generate content outside the required schema."&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;

    &lt;span class="c1"&gt;# Boundaries for reasoning&lt;/span&gt;
    &lt;span class="s2"&gt;"reasoning_limits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Reasoning must be explicit but must not include hidden steps or invented logic."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Boundaries for output&lt;/span&gt;
    &lt;span class="s2"&gt;"format_limits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Output must remain within the exact schema and must not include additional fields or commentary."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Boundaries for behaviour&lt;/span&gt;
    &lt;span class="s2"&gt;"no_role_shift"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must not change persona, identity, or role unless explicitly instructed by system messages."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Output Format Layer&lt;/span&gt;
&lt;span class="n"&gt;system_output_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"single_line_json"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Your output MUST be a SINGLE JSON object on ONE LINE ONLY."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;schema_out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"strict_structure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The output must follow the exact schema structure with no deviations."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 5. Citation / Attribution Layer&lt;/span&gt;
&lt;span class="n"&gt;system_citation_rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"label_requirement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Every citation MUST begin with the exact Incoming Context=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; label from the source."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"quote_requirement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Every citation MUST include the exact quoted line from that same context block."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_label_omission"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do NOT omit the Incoming Context label."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_label_invention"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do NOT invent labels."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_summarisation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do NOT summarise lines; quote them exactly."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"empty_citations_when_missing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"If the answer is not in the context, output an empty Citations section with correct structure."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 6. RAG Grounding Layer&lt;/span&gt;
&lt;span class="n"&gt;system_rag_grounding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"use_context_only"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Use ONLY the provided context to answer the question."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_context_no_answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"If the answer is not in the context, explicitly say so."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"multiple_valid_answers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Multiple answers may be valid; include all that are supported by the context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"context_is_authoritative"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The provided context is the ONLY source of truth."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_external_knowledge"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do NOT use outside knowledge or assumptions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"answer_must_reference_context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"All answers must be derived strictly from the context block."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 7. Reasoning Strategy Layer&lt;/span&gt;
&lt;span class="n"&gt;system_reasoning_strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# How to reason&lt;/span&gt;
    &lt;span class="s2"&gt;"carefully_read"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"First, carefully read the context and the question."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"identify_all"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Identify all relevant passages in the context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"explain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Explain, step by step, how those passages support your answer."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"explicit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Make your reasoning explicit, but concise."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_invention"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do not invent facts that are not in the context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"honesty"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The 'reasoning' field is for developers and will be logged. Be honest and explicit."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# How reasoning connects to citations&lt;/span&gt;
    &lt;span class="s2"&gt;"reasoning_field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The reasoning field must refer only to information present in the provided context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"clear_explain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Clearly explain how the quoted lines in 'citations' support the 'answer'."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"avoid_generic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Avoid generic phrases like 'based on the context'; be specific about which parts matter."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 8. Task Logic Layer&lt;/span&gt;
&lt;span class="n"&gt;system_task_logic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Instruction hierarchy&lt;/span&gt;
    &lt;span class="s2"&gt;"interpretation_priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"1. Follow system instructions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"2. Follow developer instructions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"3. Follow user instructions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"4. Follow schema and formatting rules."&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;

    &lt;span class="c1"&gt;# Ambiguity handling&lt;/span&gt;
    &lt;span class="s2"&gt;"ambiguity_rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"If the question is ambiguous, identify all plausible interpretations."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"Choose the interpretation most directly supported by the context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"If ambiguity remains, state the ambiguity explicitly in the reasoning field."&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;

    &lt;span class="c1"&gt;# Multi‑part question handling&lt;/span&gt;
    &lt;span class="s2"&gt;"multi_part_rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"If the question contains multiple sub‑questions, answer each one separately."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"If only some sub‑questions are supported by the context, answer those and state which cannot be answered."&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;

    &lt;span class="c1"&gt;# Conflict resolution&lt;/span&gt;
    &lt;span class="s2"&gt;"conflict_rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"If context passages contradict each other, cite both and explain the contradiction."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"If user instructions contradict system instructions, follow system instructions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"If schema requirements contradict user instructions, follow schema requirements."&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;

    &lt;span class="c1"&gt;# Missing‑information behaviour&lt;/span&gt;
    &lt;span class="s2"&gt;"missing_info"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"If the answer is not present in the context, explicitly say so and provide an empty citations list."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;# Strict adherence&lt;/span&gt;
    &lt;span class="s2"&gt;"no_overinterpretation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do not infer meaning beyond what is explicitly stated in the context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_assumptions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Do not assume facts, motivations, or implications not present in the context."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The code above is a list of named Python dictionaries.&lt;/p&gt;
&lt;p&gt;Three additional RAG user objects are also passed (as below) that
contain two additional pieces of data: 'context' and 'user_query'.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;context&lt;/code&gt; contains the input for the RAG. It is the result of the
local search that is chunked.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;user_query&lt;/code&gt; is the prompt from the user, e.g., "are there any
restrictions in this contract".&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;rag_user_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Context"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;rag_user_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Question"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"user_query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;rag_user_rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"context_is_authoritative"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must treat the provided context as the ONLY source of truth."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_external_knowledge"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"The assistant must not use outside knowledge or assumptions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"answer_must_reference_context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"All answers must be derived strictly from the context block."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"no_context_no_answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"If the answer is not present in the context, the assistant must explicitly state this."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"multiple_answers_allowed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"If multiple valid answers exist in the context, the assistant should include all of them."&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;OpenAI has a specific schema for JSON object input. An object with two
keys is expected 'role' and 'content'. Role is one of 'user', 'system',
or 'assistant'. 'content' is assigned the result of processing each
of the above user and system dictionaries with &lt;code&gt;to_message&lt;/code&gt;.&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Build content from all non-role fields&lt;/span&gt;
    &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="c1"&gt;# If the value is a list, join its items&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Before calling OpenAI, all of the objects above are added to a list.&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_identity&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 1&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_safety_compliance&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 2&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_capability_boundaries&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 3&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_output_format&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 4&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_citation_rules&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 5&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_rag_grounding&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 6&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_reasoning_strategy&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 7&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_task_logic&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 8&lt;/span&gt;

        &lt;span class="c1"&gt;# User context + question&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_user_context&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_user_query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;to_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_user_rules&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# optional but recommended&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;A list of processed layers makes contraining the actions of the LLM
trivial. If you need a new layer you create a new dictionary and add it
to the list, as above.&lt;/p&gt;
&lt;p&gt;The list is then passed to &lt;code&gt;build_params&lt;/code&gt;.&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;build_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'model'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'gpt-5.4-nano'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'input'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'messages'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;build_params&lt;/code&gt; ensures we target the same model each time.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;open_ai_query&lt;/code&gt; calls OpenAI's API. The python code calls a wrapper
like this to supply the &lt;code&gt;messages&lt;/code&gt; list.&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;json_ai_user_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;open_ai_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;build_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;open_ai_query&lt;/code&gt; is:&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;open_ai_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Without a valid key, this code will not work&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;your key&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Substitute your OpenAI API key here&lt;/span&gt;

    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'input'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clean_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'input'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'output_text'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_text&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'response'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'date'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'output_text'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The call to OpenAI is the line &lt;code&gt;client.responses.create(**params)&lt;/code&gt;. The value
&lt;code&gt;params&lt;/code&gt; is passed in unpacked (&lt;code&gt;**params&lt;/code&gt;) to provide dictionary keys as
function parameters. This is a convenient way of specifying what should be
passed to OpenAI.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;params&lt;/code&gt; then has a number of other keys and values assigned. This is
to support traceability.&lt;/p&gt;
&lt;p&gt;Supporting traceability will be discussed in a future article. LLM calls
require more than logging and observability. They require traceability,
especially when decisions are made based on LLM output. Our systems need to be
able to show which model was called, when, what the reasoning was, what result
was gained, and any chain of LLM calls. Logging and observability alone do not
do this.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;open_ai_query&lt;/code&gt; relies on &lt;code&gt;clean_input&lt;/code&gt; which is simply this:&lt;/p&gt;
&lt;div style="max-width: 1800px; margin: 0 auto;"&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;clean_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;codecs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"unicode_escape"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model_input&lt;/span&gt; &lt;span class="c1"&gt;# return what is given as best-effort.&lt;/span&gt;

        &lt;span class="c1"&gt;# Escape sequences may affect your results due to model tokenisation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;h1 id="increasing-the-number-of-instructions-per-layer"&gt;Increasing the number of instructions per layer&lt;/h1&gt;
&lt;p&gt;As the system prompt grows, each instruction carries less relative influence.
The model processes all tokens uniformly, so important constraints can lose
emphasis when surrounded by a large volume of text. Long prompts also make it
harder for the model to infer priority and can hide small contradictions
between layers. Clear ordering and explicit priority rules help reduce this
effect.&lt;/p&gt;
&lt;h1 id="instruction-collisions"&gt;Instruction Collisions&lt;/h1&gt;
&lt;p&gt;When multiple layers contain overlapping or conflicting instructions, the LLM
must resolve the conflict using the text alone. The final system message ithat
it sees takeis precedence, but subtle inconsistencies can weaken the intended
behaviour. Ensuring that layers do not contradict each other and that priority
is stated explicitly reduces this risk.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;h2 id="llms-require-structured-interfaces"&gt;LLMs Require Structured Interfaces&lt;/h2&gt;
&lt;p&gt;LLMs do not behave like deterministic software components. They generate
tokens based on probability, which means natural‑language prompts alone are
not a stable or reliable interface.&lt;/p&gt;
&lt;h2 id="layered-constraints-improve-reliability"&gt;Layered Constraints Improve Reliability&lt;/h2&gt;
&lt;p&gt;A layered constraint model is necessary to reduce common failure modes.
Identity, Capability Boundaries, and Output Format form the minimal stack for
programmatic use. RAG systems require additional grounding and citation
layers. Public‑facing systems require safety controls. Full reasoning systems
benefit from all eight layers.&lt;/p&gt;
&lt;h2 id="rag-provides-essential-grounding"&gt;RAG Provides Essential Grounding&lt;/h2&gt;
&lt;p&gt;RAG supplies the model with domain‑specific and current information. It
reduces hallucinations and improves factual accuracy, but it still requires
constraints to ensure the model uses retrieved content correctly.&lt;/p&gt;
&lt;h1 id="prompt-length-and-consistency-matter"&gt;Prompt Length and Consistency Matter&lt;/h1&gt;
&lt;p&gt;As system prompts grow, individual instructions lose emphasis. Clear ordering
and explicit priority rules help maintain consistent behaviour. Avoiding
contradictory instructions is essential for predictable output.&lt;/p&gt;
&lt;h1 id="failure-modes-can-be-reduced-not-removed"&gt;Failure Modes Can Be Reduced, Not Removed&lt;/h1&gt;
&lt;p&gt;LLMs remain probabilistic. Constraints reduce the likelihood of errors but
cannot eliminate them. Treating the prompt as a structured interface, rather
than a single instruction, produces more predictable, testable, and
maintainable systems.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="agents-cannot-maintain-systems.html"&gt;LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="engineers-need-to-know.html"&gt;Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai.html"&gt;Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#programmatic-interfaces-to-ai-systems"&gt;Programmatic Interfaces to AI Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-challenge"&gt;The Challenge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prompt-constraints"&gt;Prompt Constraints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-eight-layers"&gt;The Eight Layers&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#1-identity"&gt;1. Identity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-safety-compliance"&gt;2. Safety &amp;amp; Compliance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-capability-boundaries"&gt;3. Capability Boundaries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-output-format"&gt;4. Output Format&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-citation-rules"&gt;5. Citation Rules&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-rag-grounding"&gt;6. RAG Grounding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-reasoning-strategy"&gt;7. Reasoning Strategy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-task-logic"&gt;8. Task Logic&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-eight-layer-stack"&gt;The Eight Layer Stack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-minimal-stack"&gt;The Minimal Stack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-minimal-stack-for-rag"&gt;The Minimal Stack for RAG&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-minimal-stack-for-public-facing-systems"&gt;The Minimal Stack for Public-Facing Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-full-8-layer-stack"&gt;The Full 8 Layer Stack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#using-the-eight-layers-in-code"&gt;Using the Eight Layers in Code&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#openais-api-is-stateless"&gt;OpenAI's API is Stateless&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#coding-the-eight-layers"&gt;Coding the Eight Layers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#increasing-the-number-of-instructions-per-layer"&gt;Increasing the number of instructions per layer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#instruction-collisions"&gt;Instruction Collisions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#llms-require-structured-interfaces"&gt;LLMs Require Structured Interfaces&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#layered-constraints-improve-reliability"&gt;Layered Constraints Improve Reliability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#rag-provides-essential-grounding"&gt;RAG Provides Essential Grounding&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#prompt-length-and-consistency-matter"&gt;Prompt Length and Consistency Matter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#failure-modes-can-be-reduced-not-removed"&gt;Failure Modes Can Be Reduced, Not Removed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Build"></category></entry><entry><title>What Tech Executives Need to Know About Working With LLMs</title><link href="https://phroneses.com/articles/leadership/notes/tech-executives-llms.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/leadership/notes/tech-executives-llms.html</id><summary type="html">&lt;p&gt;Executives must treat LLMs as probabilistic systems requiring controls, governance, and new forms of oversight.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In working with LLMs, the software engineering industry is
at an observation stage. AI does not require business as usual
but a fundamental change in approach. This article is aimed at
those who manage software engineers so that they are aware of the
massive benefits and huge pitfalls and exposure that AI can bring.&lt;/p&gt;
&lt;h1 id="what-tech-executives-need-to-know-about-working-with-llms"&gt;What Tech Executives Need to Know About Working With LLMs&lt;/h1&gt;
&lt;h2 id="1-llms-are-not-deterministic-components"&gt;1. LLMs Are Not Deterministic Components&lt;/h2&gt;
&lt;p&gt;LLMs generate probabilistic outputs, not rule‑based results. Identical
inputs can produce different outputs. This unpredictability must be
managed with controls. It cannot be assumed away.&lt;/p&gt;
&lt;h2 id="2-llms-introduce-new-failure-modes"&gt;2. LLMs Introduce New Failure Modes&lt;/h2&gt;
&lt;p&gt;LLMs can hallucinate facts, invent sources, drift from schemas, or claim
abilities they do not have. They can produce confident but incorrect
reasoning. Traditional QA does not cover these risks.&lt;/p&gt;
&lt;h2 id="3-rag-changes-risk-it-does-not-remove-it"&gt;3. RAG Changes Risk, It Does Not Remove It&lt;/h2&gt;
&lt;p&gt;RAG improves factual grounding but adds new dependencies. Retrieval
quality, document governance, citation accuracy, and context integrity
all affect system behaviour. The data pipeline becomes part of risk
management.&lt;/p&gt;
&lt;h2 id="4-compliance-exposure-is-direct-and-material"&gt;4. Compliance Exposure Is Direct and Material&lt;/h2&gt;
&lt;p&gt;LLM outputs can violate data protection laws, sector regulations,
copyright rules, safety standards, and consumer protection laws. Because
outputs vary, violations can occur without warning. LLM output is
regulated content.&lt;/p&gt;
&lt;p&gt;LLM output is considered regulated output because, once it leaves the
model and enters your organisation’s systems, it becomes functionally
indistinguishable from any other content your company produces.
Regulators do not care that it was generated by an LLM. They care about
its effects.&lt;/p&gt;
&lt;h2 id="5-statutory-liability-extends-beyond-the-model"&gt;5. Statutory Liability Extends Beyond the Model&lt;/h2&gt;
&lt;p&gt;Liability arises from incorrect outputs, harmful content, decisions made
using LLM results, missing audit trails, and weak oversight. The
organisation, not the LLM vendor, carries the exposure.&lt;/p&gt;
&lt;h2 id="6-governance-must-be-built-into-the-architecture"&gt;6. Governance Must Be Built Into the Architecture&lt;/h2&gt;
&lt;p&gt;Systems must include identity constraints, capability boundaries, output
format rules, grounding controls, citation rules, safety layers, audit
logs, and drift monitoring. Governance is a technical requirement, not a
policy document.&lt;/p&gt;
&lt;h2 id="7-evaluation-requires-a-dedicated-function"&gt;7. Evaluation Requires a Dedicated Function&lt;/h2&gt;
&lt;p&gt;Evaluation must cover schema checks, grounding fidelity, safety tests,
reasoning quality, adversarial probing, and drift tracking. This work is
continuous and specialised. It cannot be handled ad‑hoc by developers.&lt;/p&gt;
&lt;h2 id="8-vendor-models-do-not-remove-responsibility"&gt;8. Vendor Models Do Not Remove Responsibility&lt;/h2&gt;
&lt;p&gt;Using a third‑party model does not transfer risk. Your organisation is
responsible for outputs, data handling, integration behaviour, and
controls. Outsourcing the model is not outsourcing the risk.&lt;/p&gt;
&lt;h2 id="9-llm-systems-must-be-treated-as-regulated-infrastructure"&gt;9. LLM Systems Must Be Treated as Regulated Infrastructure&lt;/h2&gt;
&lt;p&gt;LLMs influence decisions, customer interactions, internal processes, and
public content. They must be governed like any regulated system with
clear controls, auditability, and oversight.&lt;/p&gt;
&lt;h2 id="10-strategic-direction-build-capability-not-experiments"&gt;10. Strategic Direction: Build Capability, Not Experiments&lt;/h2&gt;
&lt;p&gt;Executives should invest in controlled architectures, evaluation teams,
compliance‑aligned processes, clear ownership of AI risk, continuous
monitoring, and safe scaling. LLM adoption is an organisational
capability, not a series of pilots.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;LLMs introduce technical, operational, and regulatory risks that cannot
be managed through normal development practices. Their behaviour is
probabilistic, their failure modes are unique, and their outputs carry
direct compliance and statutory exposure. The organisation must respond
with structured controls, continuous evaluation, and clear ownership.&lt;/p&gt;
&lt;h2 id="actions-for-tech-executives"&gt;Actions for Tech Executives&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Treat LLMs as high‑risk components that require strict controls.&lt;/li&gt;
&lt;li&gt;Mandate architectural layers for identity, boundaries, and format.&lt;/li&gt;
&lt;li&gt;Require governance of the retrieval pipeline in all RAG systems.&lt;/li&gt;
&lt;li&gt;Classify all LLM output as regulated content with compliance review.&lt;/li&gt;
&lt;li&gt;Establish audit trails, traceability, and runtime enforcement.&lt;/li&gt;
&lt;li&gt;Create a dedicated AI evaluation team with ongoing responsibility.&lt;/li&gt;
&lt;li&gt;Integrate legal, risk, and compliance into the development lifecycle.&lt;/li&gt;
&lt;li&gt;Do not rely on vendors for safety or liability protection.&lt;/li&gt;
&lt;li&gt;Govern LLM systems like regulated infrastructure, not experiments.&lt;/li&gt;
&lt;li&gt;Invest in long‑term capability: controlled architecture, monitoring,
  and safe scaling.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="take-away"&gt;Take Away&lt;/h2&gt;
&lt;p&gt;LLM adoption is not a feature. It is an organisational commitment that
requires governance, evaluation, and cross‑functional oversight. These
actions are the minimum required to deploy AI systems safely and
responsibly at scale.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="building-safe-llm-systems.html"&gt;LLM systems behave differently from traditional software and require layered safety, strong governance, observability, and architectural discipline to operate reliably and sustainably.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="transforming.html"&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="ai-and-brands-framework.html"&gt;AI strengthens brands when it improves precision, consistency, and control — and destroys them when it introduces noise.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#what-tech-executives-need-to-know-about-working-with-llms"&gt;What Tech Executives Need to Know About Working With LLMs&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#1-llms-are-not-deterministic-components"&gt;1. LLMs Are Not Deterministic Components&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-llms-introduce-new-failure-modes"&gt;2. LLMs Introduce New Failure Modes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-rag-changes-risk-it-does-not-remove-it"&gt;3. RAG Changes Risk, It Does Not Remove It&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-compliance-exposure-is-direct-and-material"&gt;4. Compliance Exposure Is Direct and Material&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-statutory-liability-extends-beyond-the-model"&gt;5. Statutory Liability Extends Beyond the Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-governance-must-be-built-into-the-architecture"&gt;6. Governance Must Be Built Into the Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-evaluation-requires-a-dedicated-function"&gt;7. Evaluation Requires a Dedicated Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-vendor-models-do-not-remove-responsibility"&gt;8. Vendor Models Do Not Remove Responsibility&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#9-llm-systems-must-be-treated-as-regulated-infrastructure"&gt;9. LLM Systems Must Be Treated as Regulated Infrastructure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#10-strategic-direction-build-capability-not-experiments"&gt;10. Strategic Direction: Build Capability, Not Experiments&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#actions-for-tech-executives"&gt;Actions for Tech Executives&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#take-away"&gt;Take Away&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Leadership"></category></entry><entry><title>Transforming Your Business for AI</title><link href="https://phroneses.com/articles/leadership/notes/transforming.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/leadership/notes/transforming.html</id><summary type="html">&lt;p&gt;AI adoption is an organisational transformation requiring mandates, measurement, and redesigned processes.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;AI adoption is no longer a technical experiment. It is an organisational
transformation that affects safety, compliance, cost, and long‑term operating
discipline. The organisations that succeed will be those that treat AI systems
as engineered pipelines, not magical components.&lt;/p&gt;
&lt;p&gt;This article sets out the practical steps required for your business to adopt AI
can deploy it safely, predictably, and economically.&lt;/p&gt;
&lt;h1 id="establish-clear-executive-mandates"&gt;Establish Clear Executive Mandates&lt;/h1&gt;
&lt;p&gt;Transformation begins with leadership. Executives must set non‑negotiable
expectations that shape how AI is designed and governed.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;AI systems must be predictable, observable, and auditable.  &lt;/li&gt;
&lt;li&gt;Safety controls must sit outside the model and must be layered.  &lt;/li&gt;
&lt;li&gt;Retrieval, context assembly, and orchestration must be treated as core infrastructure.  &lt;/li&gt;
&lt;li&gt;Prompts must be treated as logic: reviewed, and versioned.  &lt;/li&gt;
&lt;li&gt;Costs must be controlled through architectural discipline, not vendor optimism.  &lt;/li&gt;
&lt;li&gt;Continuous evaluation must be mandatory across all AI products.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These mandates create the conditions for responsible and sustainable adoption.&lt;/p&gt;
&lt;h1 id="build-teams-around-measurement-and-control"&gt;Build Teams Around Measurement and Control&lt;/h1&gt;
&lt;p&gt;AI systems drift. Retrieval ages. Prompts evolve. Costs rise silently. Teams
must therefore measure the system continuously.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Track retrieval quality and data freshness.  &lt;/li&gt;
&lt;li&gt;Measure latency across the entire pipeline, not only the model call.  &lt;/li&gt;
&lt;li&gt;Monitor token usage and prompt length.  &lt;/li&gt;
&lt;li&gt;Record orchestration overhead and network hops.  &lt;/li&gt;
&lt;li&gt;Detect behavioural drift through ongoing evaluation.  &lt;/li&gt;
&lt;li&gt;Break down cloud costs by retrieval, orchestration, and inference.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Measurement is the foundation of control. Without it, the system will behave in
ways that leadership cannot see or influence.&lt;/p&gt;
&lt;h1 id="redesign-processes-for-probabilistic-systems"&gt;Redesign Processes for Probabilistic Systems&lt;/h1&gt;
&lt;p&gt;Traditional software processes assume deterministic behaviour. AI systems do
not behave this way. Processes must therefore change.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Introduce continuous evaluation pipelines that mirror real user traffic.  &lt;/li&gt;
&lt;li&gt;Add retrieval monitoring to detect index drift and stale data.  &lt;/li&gt;
&lt;li&gt;Review prompts as code, with structure, clarity, and version control.  &lt;/li&gt;
&lt;li&gt;Test safety layers against varied phrasing, not only ideal cases.  &lt;/li&gt;
&lt;li&gt;Add cost reviews that examine token budgets and retrieval patterns.  &lt;/li&gt;
&lt;li&gt;Expand incident response to include retrieval logs, template expansions, and
decoding parameters.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These processes ensure that AI systems remain stable and compliant as they
evolve.&lt;/p&gt;
&lt;h1 id="enforce-architectural-principles-that-reduce-risk"&gt;Enforce Architectural Principles That Reduce Risk&lt;/h1&gt;
&lt;p&gt;AI performance, safety, and cost are determined by architecture, not by model
choice. Leaders must enforce principles that keep systems lean and predictable.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Treat latency as an architectural issue.  &lt;/li&gt;
&lt;li&gt;Minimise retrieval hops and keep data local where possible.  &lt;/li&gt;
&lt;li&gt;Keep prompts short, structured, and purposeful.  &lt;/li&gt;
&lt;li&gt;Treat context windows as scratchpads, not memory.  &lt;/li&gt;
&lt;li&gt;Avoid serial tool chains that behave like queues.  &lt;/li&gt;
&lt;li&gt;Reduce orchestration complexity, because overhead accumulates.  &lt;/li&gt;
&lt;li&gt;Ensure safety is enforced through deterministic layers, not persuasion.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These principles reduce operational risk and prevent cost escalation.&lt;/p&gt;
&lt;h1 id="introduce-governance-that-matches-the-scale-of-the-risk"&gt;Introduce Governance That Matches the Scale of the Risk&lt;/h1&gt;
&lt;p&gt;AI requires governance that is as rigorous as the systems it influences. Leaders
must introduce structures that ensure accountability and oversight.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create a cross‑functional AI governance board.  &lt;/li&gt;
&lt;li&gt;Establish prompt governance for clarity, consistency, and auditability.  &lt;/li&gt;
&lt;li&gt;Introduce retrieval governance to manage data quality and access control.  &lt;/li&gt;
&lt;li&gt;Build a safety governance framework with layered controls.  &lt;/li&gt;
&lt;li&gt;Implement cost governance that enforces architectural discipline.  &lt;/li&gt;
&lt;li&gt;Add model update governance to detect behavioural drift before deployment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Governance ensures that AI systems remain aligned with organisational standards
and regulatory expectations.&lt;/p&gt;
&lt;h1 id="prepare-the-organisation-for-cultural-change"&gt;Prepare the Organisation for Cultural Change&lt;/h1&gt;
&lt;p&gt;AI transformation is not only technical. It changes how teams think, design, and
operate.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Encourage teams to treat AI as infrastructure, not novelty.  &lt;/li&gt;
&lt;li&gt;Promote clarity, structure, and discipline in all AI‑related work.  &lt;/li&gt;
&lt;li&gt;Train teams to understand probabilistic behaviour and drift.  &lt;/li&gt;
&lt;li&gt;Build shared language around safety, compliance, and cost.  &lt;/li&gt;
&lt;li&gt;Align colleague incentives with long‑term reliability, not short‑term output.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Culture determines whether AI becomes a strategic asset or a source of risk.&lt;/p&gt;
&lt;h1 id="focus-on-business-outcomes-not-model-features"&gt;Focus on Business Outcomes, Not Model Features&lt;/h1&gt;
&lt;p&gt;The value of AI lies in outcomes, not in model specifications. Leaders must
ensure that AI investments support measurable business goals.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Improve decision quality through structured retrieval and controlled outputs.  &lt;/li&gt;
&lt;li&gt;Reduce operational cost through efficient orchestration.  &lt;/li&gt;
&lt;li&gt;Strengthen compliance through observability and audit trails.  &lt;/li&gt;
&lt;li&gt;Enhance customer trust through predictable behaviour.  &lt;/li&gt;
&lt;li&gt;Increase resilience through layered safety and disciplined design.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AI becomes transformative when it is aligned with business priorities.&lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Transforming a business for AI requires clear mandates, disciplined measurement,
new processes, strong architecture, and rigorous governance. The organisations
that succeed will be those that treat AI systems as engineered pipelines, that
design for predictability and auditability, and that recognise that the true
challenges lie not in the model, but in the machinery that surrounds it. This is
a leadership challenge as much as a technical one, and it demands clarity,
discipline, and long‑term thinking.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="tech-executives.html"&gt;Executives must treat LLMs as probabilistic systems requiring controls, governance, and new forms of oversight.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="team-ai-is-the-next-step.html"&gt;Individual AI delivers diminishing returns; meaningful improvement comes from strengthening the collective workflow.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="ai-and-brands-framework.html"&gt;AI strengthens brands when it improves precision, consistency, and control — and destroys them when it introduces noise.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#establish-clear-executive-mandates"&gt;Establish Clear Executive Mandates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#build-teams-around-measurement-and-control"&gt;Build Teams Around Measurement and Control&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#redesign-processes-for-probabilistic-systems"&gt;Redesign Processes for Probabilistic Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#enforce-architectural-principles-that-reduce-risk"&gt;Enforce Architectural Principles That Reduce Risk&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#introduce-governance-that-matches-the-scale-of-the-risk"&gt;Introduce Governance That Matches the Scale of the Risk&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prepare-the-organisation-for-cultural-change"&gt;Prepare the Organisation for Cultural Change&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#focus-on-business-outcomes-not-model-features"&gt;Focus on Business Outcomes, Not Model Features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Leadership"></category></entry><entry><title>What AI Is (and Isn't)</title><link href="https://phroneses.com/articles/foundations/notes/what-ai-is.html" rel="alternate"></link><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-26:/articles/foundations/notes/what-ai-is.html</id><summary type="html">&lt;p&gt;A clear explanation of what AI is—and is not—cutting through hype to define its real capabilities and limits.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We have all read the articles about our AI future: "AI will take your job".&lt;/p&gt;
&lt;p&gt;This article takes a different path to explain AI clearly, simply, and honestly.&lt;/p&gt;
&lt;h1 id="a-straightforward-definition-of-ai"&gt;A Straightforward Definition of AI&lt;/h1&gt;
&lt;p&gt;AI software learns patterns from lots of examples. Once it has been exposed to those patterns, it can create new text.&lt;/p&gt;
&lt;p&gt;When you ask something like "What is the weather going to do in Glasgow tomorrow?", the AI does not read the sentence the way a human does. Instead, it turns your words into numbers,&lt;/p&gt;
&lt;p&gt;Using these, the AI programming looks for relationships in the sentence. Words like "weather," "tomorrow," and "Glasgow" stand out because they are the important parts of your question.&lt;/p&gt;
&lt;p&gt;Next, the AI uses the data it was trained on (the examples) to statistically evaluate what your question is about. It does not "understand" the way people do, it just recognises patterns it has seen before.&lt;/p&gt;
&lt;p&gt;To create an answer, the AI predicts what should come next, one token at a time. A token might be a word, part of a word, or punctuation. The AI chooses the most likely next token based on patterns in its training data.&lt;/p&gt;
&lt;p&gt;This statistical selection can look like reasoning, but it is really pattern‑matching. If the AI was never trained on weather‑related information, it would not be able to give you a good answer. There would be no tokens on which to base its output.&lt;/p&gt;
&lt;p&gt;Because weather changes constantly, the AI system accesses real weather data from an external source. This is how it can give you an accurate, up‑to‑date forecast instead of basing its output on general Glasgow weather.&lt;/p&gt;
&lt;p&gt;Finally, the AI program puts everything together: your question, the patterns it has learned, the conversation so far, and the real weather data, to generate the output you see.&lt;/p&gt;
&lt;h1 id="but-is-it-intelligent"&gt;But is it Intelligent?&lt;/h1&gt;
&lt;p&gt;AI might sound intelligent, but it does not have consciousness, intentions, or real understanding. It does not know things or have opinions. All the AI program is doing is
recognising patterns in data and using those patterns to produce output.&lt;/p&gt;
&lt;p&gt;When an AI responds, it is not thinking or wanting anything; it is just following statistical cues from the data it was previously shown.&lt;/p&gt;
&lt;p&gt;AI can be incredibly powerful, but it is still just a tool. It does not think or decide things on its own. It can only work with the patterns and data it has been given.&lt;/p&gt;
&lt;p&gt;The value of AI comes from how people choose to use it, not from any independent ability or intention.&lt;/p&gt;
&lt;p&gt;When you type a message on your phone and it suggests the next word, your phone is not thinking. The program in your phone is suggesting a good possible next word based on patterns it has seen before. AI works the same way, just on a much larger scale.&lt;/p&gt;
&lt;p&gt;AI predicts what could reasonably come next in a sentence, an image, or an answer, using patterns learned from huge amounts of training data. AI can be incredibly helpful, but it is still predicting based on patterns, not understanding the world. Without the huge amounts of data, AI would have no patterns to base an answer on.&lt;/p&gt;
&lt;p&gt;Now that we have covered how AI works, here is what it can actually do well.&lt;/p&gt;
&lt;h1 id="what-ai-is-good-at"&gt;What AI Is Good At&lt;/h1&gt;
&lt;p&gt;As AI is programmed to find patterns in huge amounts of data, an AI can easily take long documents and turn them into shorter versions, based on patterns that produce clearer text.&lt;/p&gt;
&lt;p&gt;AI is great for drafting emails, rewriting paragraphs, producing variations, or helping with early versions of content.&lt;/p&gt;
&lt;p&gt;When the topic is something it has seen many examples of (such as a question about the weather), AI can give fast, reliable answers.&lt;/p&gt;
&lt;p&gt;And the vast amount of data AI is trained on means AIs are great at classification, translation, sorting, and extracting key details from text. AIs have seen so many examples, their statistical prediction can appear like it has vast knowledge. But an AI is only selecting a statistical match.&lt;/p&gt;
&lt;p&gt;AI is good at giving options, exploring possible approaches, and speeding up early‑stage work. But, AI still needs human judgement to decide whether what has been produced is of any value.&lt;/p&gt;
&lt;p&gt;There are also clear limits that are important to understand.&lt;/p&gt;
&lt;h1 id="what-ai-is-not-good-at"&gt;What AI Is Not Good At&lt;/h1&gt;
&lt;p&gt;AI recognises patterns, not ideas. AI does not understand what you type or what
it outputs.&lt;/p&gt;
&lt;p&gt;If your question is vague, emotional, or depends on context only humans share, AI often predicts incorrectly. Such a response is the AI program selecting an incorrect prediction based on its statistics.&lt;/p&gt;
&lt;p&gt;AI cannot weigh consequences, values, ethics, or trade‑offs. It can only follow patterns in data. As it does not understand in the human sense, AI cannot perform judgement. Judgement requires intent, values, responsibility, and lived experience. AI has none of these.&lt;/p&gt;
&lt;p&gt;However, AI can simulate judgement extremely well because it has access to vast patterns of expert reasoning, it can structure arguments, and it can select options based on criteria you give it. But this is not judgment. It is pattern-based statistical selection without understanding.&lt;/p&gt;
&lt;p&gt;AI can remix and generate new combinations, but it does not have taste, purpose, or a point of view.&lt;/p&gt;
&lt;p&gt;Anything involving physical experience, social cues, or human behaviour is outside its reach. If you say, "My car has a flat tyre," a person knows that the car
cannot be driven safely, that to fix it you will need tools and that the fix is inconvenient and messy.&lt;/p&gt;
&lt;p&gt;An AI has never changed a tyre. It does not know weight, effort, or danger. It only has access to what people have written about flat tyres.&lt;/p&gt;
&lt;p&gt;An AI can describe the steps to fix the flat (as a person has written about this in the past and this writing is in the training data), but AI does not understand the situation.&lt;/p&gt;
&lt;p&gt;An AI has no lived experience, so it can miss things a person might notice. If someone says, "I brought a bottle of wine to the dinner," a person knows this is a polite gesture. AI does not know social customs, it only has access to training data about customs written by a person.&lt;/p&gt;
&lt;h1 id="your-ai-does-not-know-anything"&gt;Your AI does not know anything&lt;/h1&gt;
&lt;p&gt;AI can sound confident even when it is completely mistaken, because it does not know what it does not know.&lt;/p&gt;
&lt;p&gt;If you ask for restaurant recommendations in a town that does not exist, some AIs may still try to answer, giving you incorrect information as the town does not exist.&lt;/p&gt;
&lt;p&gt;When an AI lacks information, it cannot feel uncertainty or recognise gaps the way people do, so it simply produces the most plausible‑sounding answer based on the patterns it currently has access to.&lt;/p&gt;
&lt;p&gt;An AI might confidently state that Venus has two moons, or invent a law that does not exist or describe an imaginary species as if it were real. Because AI never checks facts or senses its own limits, its pattern‑filling behaviour leads to "hallucinations," where the AI creates details, sources, or events that sound right but are not true.&lt;/p&gt;
&lt;p&gt;If the training data is thin, biased, or missing, the output will be unreliable, no matter how polished the output looks.&lt;/p&gt;
&lt;p&gt;If you ask an AI about something that barely exists in its training data — say, "What dishes are served at the Spring Feast in Millford Glen?", the AI will not calculate that the place or event is fictional.&lt;/p&gt;
&lt;p&gt;With nothing solid to draw from, the AI's program uses loose patterns and produces something that only sounds right, like "They usually serve herb stew and blossom cakes." The answer feels plausible, but it is really just the AI making a poor prediction because the information is too thin.&lt;/p&gt;
&lt;h1 id="the-biggest-misconceptions-about-ai"&gt;The Biggest Misconceptions About AI&lt;/h1&gt;
&lt;p&gt;Many people believe AI thinks, understands, or decides in the way a person does, but this is not the case. AI does not grasp meaning, hold values, or judge situations. It only reflects patterns in the material it was trained on.&lt;/p&gt;
&lt;p&gt;Another misconception is that AI has reliable knowledge about everything. When information is scarce, it often fills the gaps with predictions that sound believable but are not accurate. AI has access to vast data stores. AI has no knowledge, just data and a program to spot patterns.&lt;/p&gt;
&lt;p&gt;People also assume AI is neutral, yet it inherits the biases and assumptions present in its training data. Some imagine AI as a step toward consciousness, but it has no awareness or sense of self. It is a powerful tool, but still a tool, and it must be used with a clear understanding of its limits.&lt;/p&gt;
&lt;h1 id="how-to-use-ai-safely-and-effectively"&gt;How to Use AI Safely and Effectively&lt;/h1&gt;
&lt;p&gt;Using AI safely and effectively starts with treating it as a helpful assistant rather than an authority. It works best when you give it clear instructions, specific goals, and enough context to guide the response.&lt;/p&gt;
&lt;p&gt;It is important to check the information it provides, especially when accuracy matters, because it can sound confident even when it is mistaken.&lt;/p&gt;
&lt;p&gt;AI is strongest when you use it to explore ideas, draft material, summarise information, or speed up routine tasks, while keeping final judgement for yourself.&lt;/p&gt;
&lt;p&gt;AI can boost your creativity, improve your productivity, and help you think in new ways, as long as you stay aware of its limits and verify anything that needs to be correct.&lt;/p&gt;
&lt;h1 id="what-to-keep-in-mind-about-ai"&gt;What to Keep in Mind About AI&lt;/h1&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;AI recognises patterns but does not understand meaning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It predicts what should come next based on data it has seen.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It is strong at summarising, drafting, sorting, and exploring ideas.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It struggles with judgement, context, emotions, and real‑world experience.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It can sound confident even when it is wrong.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It works best when you guide it, check its output, and stay in control.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h1 id="a-simple-mental-model-to-remember"&gt;A Simple Mental Model to Remember&lt;/h1&gt;
&lt;p&gt;Think of AI as a very capable assistant that is excellent at helping you
create, explore, and organise ideas, but one that still needs you to guide it
and check its work.&lt;/p&gt;
&lt;p&gt;AI is powerful but not magical. It recognises patterns but does not understand.
You get the best results when you guide it, check its work, and stay in
control.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="how-ai-works.html"&gt;An explanation of how large language models actually function and why they should not be treated as miniature humans.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai-claims.html"&gt;A framework for evaluating claims made about AI systems, focusing on evidence, capability, and verifiable performance.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="how-to-use.html"&gt;Guidance on using AI safely and effectively, grounded in recent examples of misuse and emerging best practices.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#a-straightforward-definition-of-ai"&gt;A Straightforward Definition of AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#but-is-it-intelligent"&gt;But is it Intelligent?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-ai-is-good-at"&gt;What AI Is Good At&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-ai-is-not-good-at"&gt;What AI Is Not Good At&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#your-ai-does-not-know-anything"&gt;Your AI does not know anything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-biggest-misconceptions-about-ai"&gt;The Biggest Misconceptions About AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-to-use-ai-safely-and-effectively"&gt;How to Use AI Safely and Effectively&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-to-keep-in-mind-about-ai"&gt;What to Keep in Mind About AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-simple-mental-model-to-remember"&gt;A Simple Mental Model to Remember&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Foundations"></category></entry><entry><title>What software engineers need to know about LLMs</title><link href="https://phroneses.com/articles/build/notes/software-engineers-need-to-know.html" rel="alternate"></link><published>2026-04-25T00:00:00+00:00</published><updated>2026-04-25T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-25:/articles/build/notes/software-engineers-need-to-know.html</id><summary type="html">&lt;p&gt;Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Large language models (LLMs) are disrupting the software engineering industry.
Executives and software engineers now have a tool at their disposal that
is so general in its scope that it can be dedicated to almost any task.
LLMs are the ultimate "jack of all trades". It is our job to get the most
from them.&lt;/p&gt;
&lt;h1 id="the-real-interface-tokens-not-text"&gt;The real interface: tokens, not text&lt;/h1&gt;
&lt;p&gt;Tokens shape what you can build. They decide how much context you can fit in,
how fast the model responds, and how predictable the output is.&lt;/p&gt;
&lt;p&gt;Token boundaries also change how the model interprets structure.  Two prompts that
look identical to you may tokenize differently and produce different behaviour.&lt;/p&gt;
&lt;p&gt;When you design prompts, AI input or output schemas, or retrieval pipelines,
you are really designing token flows. If you ignore tokens, you end up shipping
features that behave one way in tests and another way in production.&lt;/p&gt;
&lt;p&gt;Prompt A:
"Summarize the user login flow."&lt;/p&gt;
&lt;p&gt;Prompt B:
"Summarise the user login flow."&lt;/p&gt;
&lt;p&gt;To a human, the difference is not consequential. To a tokenizer, there is a critical difference.&lt;/p&gt;
&lt;p&gt;"Summarize" and "Summarise" break into different token sequences.&lt;/p&gt;
&lt;p&gt;The model’s internal statistics for each spelling differ.&lt;/p&gt;
&lt;p&gt;The model may shift tone, structure, or level of detail.&lt;/p&gt;
&lt;p&gt;And downstream formatting can change because the token pattern changed.&lt;/p&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;p&gt;Prompt A:
"List the steps to deploy the service."&lt;/p&gt;
&lt;p&gt;Prompt B:
"List the steps to deploy the service ."&lt;/p&gt;
&lt;p&gt;The only difference is a space before the full-stop.&lt;/p&gt;
&lt;p&gt;Prompt A ends with a single token for "service."&lt;/p&gt;
&lt;p&gt;Prompt B ends with two tokens: "service" and "."&lt;/p&gt;
&lt;p&gt;That tiny shift can change the model’s prediction path.&lt;/p&gt;
&lt;h1 id="the-model-is-not-the-system"&gt;The model is not the system&lt;/h1&gt;
&lt;p&gt;Most failures blamed on models usually come from everything wrapped
around them. In practice, the weak points look very familiar to any
engineer who has shipped a distributed system.&lt;/p&gt;
&lt;p&gt;Retrieval pipelines drift because indexes age, embeddings shift, and
data freshness is rarely monitored. A model can only answer the
question you actually retrieved, not the one you meant to retrieve.&lt;/p&gt;
&lt;p&gt;Prompt templates collapse under odd inputs because they are often
treated as static strings instead of executable logic. One unexpected
newline or a missing field can break the entire chain of reasoning. Data
freshness and data cleansing is key here.&lt;/p&gt;
&lt;h2 id="guardrails"&gt;Guardrails&lt;/h2&gt;
&lt;p&gt;Guardrails miss edge cases because they rely on pattern matching, not
semantic guarantees. A single unhandled phrasing can bypass a rule
that looked airtight in testing.&lt;/p&gt;
&lt;p&gt;Imagine you build a guardrail that blocks requests containing
"delete all users". It works in tests, so you ship it.&lt;/p&gt;
&lt;p&gt;Then a real user sends:
"can you delete all the users"
or
"please delete every user"
or
"remove all user accounts"&lt;/p&gt;
&lt;p&gt;Your guardrail only catches the exact phrase it was written for. It
matches strings, not meaning. One phrasing slips through, and the model
executes a path you thought was protected.&lt;/p&gt;
&lt;p&gt;Many guardrails end up acting like string comparisons even when they
use embeddings or classifiers. They match surface patterns, not intent.
If the phrasing shifts, the guardrail often fails.&lt;/p&gt;
&lt;p&gt;For example, a rule might block "delete all users" because that exact
pattern was seen during testing. But the same system may allow "remove
every user account" because the embedding distance is just far enough
to slip past the threshold.&lt;/p&gt;
&lt;p&gt;This is the same failure mode as brittle input validation. If your
rules depend on matching specific strings or narrow patterns, you get
a system that behaves safely in tests and unpredictably in production.&lt;/p&gt;
&lt;p&gt;You cannot solve this by telling the model “if a request is like
'delete all users', refuse to do it”. That feels intuitive, but it
fails for the same reason input‑validation-by-string-match fails in
any other system.&lt;/p&gt;
&lt;p&gt;A prompt can describe the rule, but it cannot enforce the rule. The
model will try to follow the instruction, but it has no semantic
guarantee. It can still be persuaded, confused, or bypassed by a
phrasing it has not seen before.&lt;/p&gt;
&lt;p&gt;To actually solve this, you need layered controls outside the model:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Treat the model as untrusted. Never let it directly execute
   destructive actions. Put a permission layer between the model and
   anything irreversible.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Normalise user input before it reaches the model. Collapse
   phrasing, remove fluff, and classify intent. This gives you a
   stable signal instead of raw text.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use a separate classifier or rules engine to detect dangerous
   intent. This component should be simpler, more predictable, and
   easier to test than the model itself.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Require explicit confirmation for destructive operations. The
   model can propose an action, but a deterministic system must
   approve it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Log every step. When something slips through, you need to see the
   input, the normalised form, the classification result, and the
   model’s output.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The prompt can express the policy, but the system must enforce it.
If you rely on the model alone, you are depending on pattern
matching. If you build a layered pipeline, you get behaviour you can
reason about, test, and trust.&lt;/p&gt;
&lt;h2 id="observability"&gt;Observability&lt;/h2&gt;
&lt;p&gt;Observability is weak because most systems log the request and the response,
but not the context, the retrieval set, the template expansion, or the decoding
parameters. When working with LLMs, without the context, retrieval set,
template expansion and parameter decoding, debugging is guesswork.&lt;/p&gt;
&lt;h2 id="an-llm-is-at-the-centre-of-a-much-larger-system"&gt;An LLM is at the centre of a much larger system&lt;/h2&gt;
&lt;p&gt;The LLM is only one component. The system around it decides whether
your product behaves like a tool or a slot machine. Engineers who
treat the whole pipeline as a software system, not a magic box, build
the reliable systems.&lt;/p&gt;
&lt;h1 id="determinism-is-a-design-choice"&gt;Determinism is a design choice&lt;/h1&gt;
&lt;p&gt;LLMs are probabilistic, but stability is possible. Temperature and
top‑p control variance. Structured outputs reduce drift. Deterministic
decoding is often more reliable than clever prompts. Treat randomness
as a resource you allocate.&lt;/p&gt;
&lt;p&gt;Temperature stretches or compresses the probability distribution.  Top‑p chops
off the tail of the distribution.&lt;/p&gt;
&lt;h1 id="temperature"&gt;Temperature&lt;/h1&gt;
&lt;p&gt;As temperature increases, the LLM becomes more willing to pick
lower‑probability tokens, which effectively means the "token candidate set"
gets larger.&lt;/p&gt;
&lt;p&gt;More accurately, low‑probability tokens get boosted, high‑probability tokens get flattened.&lt;/p&gt;
&lt;p&gt;This means: the model is less confident, more tokens become available, and he
sampling process has more room to explore. The next token is drawn from a wider
effective set&lt;/p&gt;
&lt;h1 id="top-p"&gt;Top-p&lt;/h1&gt;
&lt;p&gt;Top‑p (also called nucleus sampling) restricts the model to sampling only from
the smallest set of tokens whose cumulative probability is ≥ p.&lt;/p&gt;
&lt;p&gt;Think of it as a probability mass cutoff.&lt;/p&gt;
&lt;h2 id="example"&gt;Example&lt;/h2&gt;
&lt;p&gt;Suppose the model predicts the next‑token distribution like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Token&lt;/th&gt;
&lt;th&gt;Probability&lt;/th&gt;
&lt;th&gt;Cumulative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Sorted by probability, cumulative mass builds like this:&lt;/p&gt;
&lt;p&gt;A → 0.40
A+B → 0.65
A+B+C → 0.80
A+B+C+D → 0.90
A+B+C+D+E → 0.95
A+B+C+D+E+F → 1.00&lt;/p&gt;
&lt;p&gt;Now apply top‑p:&lt;/p&gt;
&lt;p&gt;top‑p = 0.5&lt;/p&gt;
&lt;p&gt;Working down the ordered Probability column abov, we include tokens until
the probability is cumulatively ≥ 0.5. Token A + B are allowed as they are the
first tokens for whom the cumulative probability is ≥ 0.5. Once the
condition is satisfied, we stop descending the column.&lt;/p&gt;
&lt;p&gt;With top-p = 0.5, only tokens A and B are allowed.&lt;/p&gt;
&lt;p&gt;For top‑p = 0.8&lt;/p&gt;
&lt;p&gt;Include tokens until cumulative ≥ 0.8 → A + B + C. Only A, B, C are allowed.&lt;/p&gt;
&lt;p&gt;top‑p = 0.95&lt;/p&gt;
&lt;p&gt;Include tokens until cumulative ≥ 0.95 → A + B + C + D + E. Tokens A to E
allowed; F is excluded.&lt;/p&gt;
&lt;p&gt;When top‑p = 1.0&lt;/p&gt;
&lt;p&gt;No restriction — all tokens allowed.&lt;/p&gt;
&lt;h2 id="passing-temperature-and-top-p-to-openai"&gt;Passing temperature and top-p to OpenAI&lt;/h2&gt;
&lt;p&gt;In calling OpenAI, you can pass this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"gpt-4.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Explain temperature and top-p."&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="s2"&gt;"temperature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"top_p"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The last two fields directly control the sampling behaviour.&lt;/p&gt;
&lt;p&gt;You are telling the model:&lt;/p&gt;
&lt;p&gt;"Always pick the highest‑probability token. No randomness."&lt;/p&gt;
&lt;p&gt;This is the closest thing to true determinism.&lt;/p&gt;
&lt;p&gt;With temperature set to 0.0, the highest‑probability token is guaranteed to be
selected, as long as the decoding method is greedy and no other randomness is
introduced by the API or framework.&lt;/p&gt;
&lt;p&gt;In an LLM, the decoder is the component that turns the model’s probability
distribution into tokens.&lt;/p&gt;
&lt;p&gt;Even with temperature equal to 0.0, top‑p could still exclude the
highest‑probability token. For example, if the highest‑probability token is
outside the top‑p nucleus (rare but possible with unusual distributions), the
decoder would be forced to pick a different token. The nucleus is the group of
tokens built cumulatively above.&lt;/p&gt;
&lt;p&gt;Temperature = 0.0 and top_p = 1.0 is the strictest, safest deterministic
configuration.&lt;/p&gt;
&lt;h2 id="context-windows-are-not-memory"&gt;Context windows are not memory&lt;/h2&gt;
&lt;p&gt;AI vendors such as Anthropic and OpenAI control the LLM's window size, but you
control how effectively you use it.&lt;/p&gt;
&lt;p&gt;OpenAI's GPT‑5.4 has a 1,050,000‑token context window. GPT‑5.2, GPT‑5.1, and
GPT‑5.1 Codex Max have 400,000‑token windows.&lt;/p&gt;
&lt;p&gt;The window size is fixed at training time. Changing it requires retraining or
re‑architecting the model, which only the vendor can do.&lt;/p&gt;
&lt;p&gt;The vendor sets the ceiling. You decide how close you get to it.  A 1M‑token
window sounds like "great, I can dump everything in." But that is the wrong
mental model.&lt;/p&gt;
&lt;p&gt;The engineer decides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how much of the window to fill&lt;/li&gt;
&lt;li&gt;how aggressively to compress&lt;/li&gt;
&lt;li&gt;how to structure retrieval&lt;/li&gt;
&lt;li&gt;how to order information&lt;/li&gt;
&lt;li&gt;how to avoid interference&lt;/li&gt;
&lt;li&gt;how to budget tokens across system prompts, instructions, schemas, and retrieved docs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The vendor gives you the maximum.  You determine the effective window.&lt;/p&gt;
&lt;p&gt;A large window looks powerful, yet it behaves nothing like a bigger RAM module.
The more of the window you use and the larger your use becomes, the model has
to scan and reconcile far more information than it can reliably use. The
signal‑to‑noise ratio drops, and the model starts leaning on familiar
statistical patterns instead of the details that matter.&lt;/p&gt;
&lt;p&gt;Position inside the window matters more than the raw size. Early and
late tokens are not treated equally, and different models weight them
differently. There is no guarantee that the most recent content is the
content the model will use. This is why long prompts often ignore the
last instruction you added.&lt;/p&gt;
&lt;p&gt;Large windows also increase interference. When you pack in too much
material, similar concepts begin to blur. Two sections that look
distinct to you can collide inside the model’s internal
representation. The output feels vague or inconsistent even though the
inputs look clean.&lt;/p&gt;
&lt;h2 id="retrieval-quality-beats-window-size"&gt;Retrieval quality beats window size&lt;/h2&gt;
&lt;p&gt;This is why retrieval quality beats window size. Retrieval gives you
control over what enters the window and where it goes. A large window
without retrieval is just a bigger bucket. A smaller window with good
retrieval is a structured workspace.&lt;/p&gt;
&lt;p&gt;Retrieval here is any form of data retrieval that is performed before
being sent to the LLM. This may be the result of a classic RAG pipeline
where a local search of a document store is performed and the results
chunked before being passed to the LLM that is instructed to restrict
its analysis to the uploaded search data.&lt;/p&gt;
&lt;p&gt;But retrieval here is more general than RAG. It refers to the smart
selection of data for an LLM to process. Retrieval may bring data back
from a SQL, Graph or NoSQL query, or it may be the smart selection of
summaries or user's notes pulled from storage.&lt;/p&gt;
&lt;p&gt;The opposite of retrieval is dumping everything in raw.&lt;/p&gt;
&lt;p&gt;The most reliable mental model is to treat the window as a scratchpad.
It is a temporary working area, not a knowledge store. You place only
what the model needs for the current task, in the order that helps it
reason. If you treat the window like long‑term memory, you get
unpredictable behaviour. If you treat it like a scratchpad, you get
control.&lt;/p&gt;
&lt;h1 id="llms-compress-patterns-not-facts"&gt;LLMs compress patterns, not facts&lt;/h1&gt;
&lt;p&gt;When an LLM is trained, the input training data will be measured in terabytes.
The output is billions of weights that encode the statistical structure of the
training data.  Those weights are the model es the weights: patterns (common
sequences, phrasing, structures, and correlations); relationships (semantic
similarity, analogies); generalisation behaviour (moving between examples via
statistical interpolation); and task-relevant transformations to assist with
instruction following, data formatting. and conversational norms.&lt;/p&gt;
&lt;p&gt;LLMs do not store data; they are not databases. They store weights that represent
patterns from the training data.&lt;/p&gt;
&lt;p&gt;Many different training examples can be represented internally by the same (or very
similar) set of weights.&lt;/p&gt;
&lt;p&gt;As different examples can be represented by the same weights, LLMs have a tendancy
to hallucinate. Hallucinations are baked into the design of LLMs.&lt;/p&gt;
&lt;p&gt;Training takes terabytes of text and produces billions of updates into a fixed‑size model
and outputs the weights that approximates the training data.&lt;/p&gt;
&lt;p&gt;In doing this the transformation is many‑to‑one (different examples collapse together), and
irreversible as you cannot reconstruct the originl training data from the weights. But,
more importantly, the output is statistical as the weights encode likelihoods, not facts.&lt;/p&gt;
&lt;p&gt;Because of this, the model cannot store exact information.  It can only store patterns.&lt;/p&gt;
&lt;p&gt;Where patterns overlap, details are lost.  Where details are lost, the model fills in the gaps.&lt;/p&gt;
&lt;p&gt;That filling‑in is what we call hallucination. The many-to-one transformation also explains
why rare facts vanish and plausible but false details appear.&lt;/p&gt;
&lt;p&gt;A fluent answer is not necessaily a correct one. A fluent answer should not be over-trusted.&lt;/p&gt;
&lt;p&gt;An LLM is not a database or lookup table.  They are function approximators
trained on vast data, forced to compress it into a limited parameter space (weights), and
optimised for prediction, not truth.&lt;/p&gt;
&lt;h1 id="prompting-is-programming"&gt;Prompting is programming&lt;/h1&gt;
&lt;p&gt;Prompts act like programs for a probabilistic interpreter. And as they
are written in natural language, prompts are prone to the mistakes that
humans make in written instructions: ambiguity, no being explicit on what
is required; not stating what is not required; and failing to mention who
the output is for.&lt;/p&gt;
&lt;p&gt;Structure beats style so that you can be sure your prompt acts more like
a foundation for a robust interface, rather than one without structur built
on shifting sand.&lt;/p&gt;
&lt;h1 id="constraints"&gt;Constraints&lt;/h1&gt;
&lt;p&gt;Constraints beat persuasion. Constraining your LLM is essential. It is not about "being firm"
with the model. It is about shaping the space of valid outputs so the model cannot wander.&lt;/p&gt;
&lt;p&gt;In a prompt, when you say:&lt;/p&gt;
&lt;p&gt;"Please answer carefully.”;"Try not to hallucinate.”;"Make sure you follow the
instructions.”; "Be precise."&lt;/p&gt;
&lt;p&gt;You are appealing to behaviour the model cannot guarantee, because persuasion
relies on the model choosing to comply. "Please answer carefully" is a request. The LLM
should "try not to hallucinate". What if it does? You have not said. This is like
neglecting to define an &lt;code&gt;else&lt;/code&gt; on an &lt;code&gt;if&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Persuasion is weak because it competes with every other pattern the model has learned.&lt;/p&gt;
&lt;p&gt;Constraints, by contrast, reshape the output space.&lt;/p&gt;
&lt;p&gt;A constraint is something that reduces the degrees of freedom the model has when generating.&lt;/p&gt;
&lt;p&gt;Examples of constraints are having the prompt specify that the LLM &lt;em&gt;must&lt;/em&gt; output its
result using a schema or specifying a role with explicit boundaries such as a 'user',
'system', or 'assistant' or by specifying the LLM "must cite X before Y".&lt;/p&gt;
&lt;p&gt;Instead of trying to "convince" the model to behave, you damp down as close
to zero as possible the possibility of misbehaviour.&lt;/p&gt;
&lt;p&gt;Schemas beat prose. Treat prompts as code and debug them as code. Systems
behave better when you design prompts as logic, not decoration.&lt;/p&gt;
&lt;h1 id="conclusions"&gt;Conclusions&lt;/h1&gt;
&lt;p&gt;Tokens drive model behaviour, so any dependable LLM system must be engineered
around token‑level effects rather than surface text; the fragile parts of the
stack are the retrieval, templates, guardrails, and data plumbing wrapped around
the model, not the model itself; guardrails only become reliable when enforced
by deterministic system logic instead of relying on the model’s cooperation;
observability must reveal every transformation in the pipeline to make failures
diagnosable; context windows function as short‑lived workspaces rather than any
form of memory; retrieval quality has a larger impact on correctness than window
size; hallucination is an unavoidable consequence of pattern compression and
must be mitigated through system design rather than trust; and prompting only
becomes stable when treated as programming with explicit constraints instead of
attempts at persuasion.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="agents-cannot-maintain-systems.html"&gt;LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="ai-engineering-team-based-ai.html"&gt;The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="surface-area.html"&gt;AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-real-interface-tokens-not-text"&gt;The real interface: tokens, not text&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-model-is-not-the-system"&gt;The model is not the system&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#guardrails"&gt;Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#observability"&gt;Observability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#an-llm-is-at-the-centre-of-a-much-larger-system"&gt;An LLM is at the centre of a much larger system&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#determinism-is-a-design-choice"&gt;Determinism is a design choice&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#temperature"&gt;Temperature&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#top-p"&gt;Top-p&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#example"&gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#passing-temperature-and-top-p-to-openai"&gt;Passing temperature and top-p to OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#context-windows-are-not-memory"&gt;Context windows are not memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#retrieval-quality-beats-window-size"&gt;Retrieval quality beats window size&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#llms-compress-patterns-not-facts"&gt;LLMs compress patterns, not facts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prompting-is-programming"&gt;Prompting is programming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#constraints"&gt;Constraints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Build"></category></entry><entry><title>A Beginner's Guide to AI Chatbot Prompting</title><link href="https://phroneses.com/articles/foundations/notes/ai-chatbot-prompting.html" rel="alternate"></link><published>2026-04-22T00:00:00+00:00</published><updated>2026-04-22T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2026-04-22:/articles/foundations/notes/ai-chatbot-prompting.html</id><summary type="html">&lt;p&gt;Clear, practical prompting habits to help you get faster, more reliable results from everyday AI tasks.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="a-beginners-guide-to-ai-chatbot-prompting"&gt;A Beginner’s Guide to AI Chatbot Prompting&lt;/h1&gt;
&lt;p&gt;This guide gives beginners a clear, practical foundation for working with AI chatbots. Each section focuses on one skill, why it matters, and how to apply it.&lt;/p&gt;
&lt;h2 id="1-what-prompting-is-and-why-it-matters"&gt;1. What Prompting Is and Why It Matters&lt;/h2&gt;
&lt;p&gt;Prompting is the skill of giving clear instructions to a chatbot so that
you are more likely to get a useful response.&lt;/p&gt;
&lt;p&gt;Good prompts will reduce confusion and save you time. A poor
prompt can waste time as you work you way through an answer that does
not hit the spot.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Vague: "Explain photosynthesis"&lt;/li&gt;
&lt;li&gt;Clear: "Explain photosynthesis in simple terms for a 12‑year‑old"&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you try these you will see that the second one is a completelt different
response from the first. It is more direct and easier to read.&lt;/p&gt;
&lt;h2 id="2-start-with-a-direct-request"&gt;2. Start With a Direct Request&lt;/h2&gt;
&lt;p&gt;A simple, explicit request sets the direction.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"Write a short summary of this article"&lt;/li&gt;
&lt;li&gt;"Give me three ideas for a birthday message"&lt;/li&gt;
&lt;li&gt;"Explain how this code works"&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the short summary prompt, startinmg on a new line, pase in the
article you are referring to.&lt;/p&gt;
&lt;h2 id="3-add-context-to-aim-the-response"&gt;3. Add Context to Aim the Response&lt;/h2&gt;
&lt;p&gt;Context helps the chatbot match your level, purpose, or constraints.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"I am new to London, UK. Explain what I can do on a wet Sunday."&lt;/li&gt;
&lt;li&gt;"I am preparing for a job interview. Give me sample questions."&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;London, UK is specified to keep the prompt clear as there are many
places in the world called London. How many?&lt;/p&gt;
&lt;p&gt;"Give the total number of places in the world called London, no variants. List the names"&lt;/p&gt;
&lt;h2 id="4-specify-the-format-you-want"&gt;4. Specify the Format You Want&lt;/h2&gt;
&lt;p&gt;Format guides structure and makes the output easier to use.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"Give me a bullet‑point list"&lt;/li&gt;
&lt;li&gt;"Write a short paragraph"&lt;/li&gt;
&lt;li&gt;"Produce a step‑by‑step explanation"&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="5-set-clear-constraints"&gt;5. Set Clear Constraints&lt;/h2&gt;
&lt;p&gt;Constraints keep the answer focused and predictable.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"Keep it under 150 words"&lt;/li&gt;
&lt;li&gt;"Use plain English"&lt;/li&gt;
&lt;li&gt;"No jargon"&lt;/li&gt;
&lt;li&gt;"Be concise"&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="6-use-examples-to-anchor-tone-and-style"&gt;6. Use Examples to Anchor Tone and Style&lt;/h2&gt;
&lt;p&gt;Examples show the chatbot what "good" looks like.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"Write it in the style of this: 'Short, direct, and practical.'"&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="7-adjust-over-time-instead-of-restarting"&gt;7. Adjust Over Time Instead of Restarting&lt;/h2&gt;
&lt;p&gt;Treat the chatbot as a collaborator. Adjust the output rather than rewriting the whole prompt.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"Shorten this"&lt;/li&gt;
&lt;li&gt;"Make it more formal"&lt;/li&gt;
&lt;li&gt;"Add one more example in the first paragraph"&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="8-ask-for-alternatives-when-you-need-options"&gt;8. Ask for Alternatives When You Need Options&lt;/h2&gt;
&lt;p&gt;Variations help you compare and choose.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"Give me two more options"&lt;/li&gt;
&lt;li&gt;"Rewrite this with a friendlier tone"&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="9-break-complex-tasks-into-steps"&gt;9. Break Complex Tasks Into Steps&lt;/h2&gt;
&lt;p&gt;Step‑by‑step prompting keeps large tasks managoeable.&lt;/p&gt;
&lt;p&gt;AI chatbots are pattern matching. If your prompt is long, the AI may appear 
to skip something you say as it does not have a strong pattern to match to it.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"First, outline the structure. Then we will fill in each section."&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="10-common-mistakes-to-avoid"&gt;10. Common Mistakes to Avoid&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Being too vague&lt;/li&gt;
&lt;li&gt;Asking for everything at once&lt;/li&gt;
&lt;li&gt;Forgetting to specify the audience&lt;/li&gt;
&lt;li&gt;Not having the AI give examples&lt;/li&gt;
&lt;li&gt;Expecting perfect output on the first try&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="11-quick-prompt-templates"&gt;11. Quick Prompt Templates&lt;/h2&gt;
&lt;p&gt;These templates give learners a starting point that you can adapt.&lt;/p&gt;
&lt;h3 id="explain-something"&gt;Explain Something&lt;/h3&gt;
&lt;p&gt;"Explain [topic] to [audience] in [format]. Keep it [constraints]."&lt;/p&gt;
&lt;p&gt;"Explain beaches to a 10 year-old in one pargraph. Keep it positive and clear." &lt;br/&gt;
"Explain beaches to an adult in one pargraph. Keep it positive and clear." &lt;br/&gt;
"Explain beaches."&lt;/p&gt;
&lt;h3 id="rewrite-something"&gt;Rewrite Something&lt;/h3&gt;
&lt;p&gt;"Rewrite this text to be more [tone]. Keep the meaning the same."&lt;/p&gt;
&lt;p&gt;"Give first line of Pride and Prejudice by Jane Austen." &lt;br/&gt;
"Rewrite using corporate speak. Keep the meaning the same but push the buzzwords to 11."&lt;/p&gt;
&lt;h3 id="generate-ideas"&gt;Generate Ideas&lt;/h3&gt;
&lt;p&gt;"Give me [number] ideas for [goal]. Keep them practical."o&lt;/p&gt;
&lt;p&gt;"Give me 5 ideas for walking down the sidewalk. Keep them practical."&lt;/p&gt;
&lt;h3 id="troubleshoot"&gt;Troubleshoot&lt;/h3&gt;
&lt;p&gt;"I am seeing this issue: [a detailed description]. Give me possible causes and simple steps to check."&lt;/p&gt;
&lt;p&gt;"I am seeing this issue: my grass is too yellow. Give me possible causes and simple checks to check."&lt;/p&gt;
&lt;h2 id="12-practice-prompts"&gt;12. Practice Prompts&lt;/h2&gt;
&lt;p&gt;Use these to build confidence and develop prompting habits.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"Explain how a mortgage works as if I am new to finance."&lt;/li&gt;
&lt;li&gt;"Give me three ways to describe my job in a CV. I have pasted my CV."&lt;/li&gt;
&lt;li&gt;"Summarise the following paragraph in one sentence."&lt;/li&gt;
&lt;li&gt;"Suggest improvements to this email without changing the intent."&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="10-things.html"&gt;Ten simple AI workflows that save minutes each day and compound into hours each week, helping people work more efficiently.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="designing-ai-prompts.html"&gt;Modern AI systems require structured, multi‑step prompts that guide planning, critique, and long‑context reasoning.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai-chatbot.html"&gt;A practical guide to assessing the quality, reliability, and safety of AI chat session outputs.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#a-beginners-guide-to-ai-chatbot-prompting"&gt;A Beginner’s Guide to AI Chatbot Prompting&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#1-what-prompting-is-and-why-it-matters"&gt;1. What Prompting Is and Why It Matters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#2-start-with-a-direct-request"&gt;2. Start With a Direct Request&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#3-add-context-to-aim-the-response"&gt;3. Add Context to Aim the Response&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-specify-the-format-you-want"&gt;4. Specify the Format You Want&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#5-set-clear-constraints"&gt;5. Set Clear Constraints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#6-use-examples-to-anchor-tone-and-style"&gt;6. Use Examples to Anchor Tone and Style&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#7-adjust-over-time-instead-of-restarting"&gt;7. Adjust Over Time Instead of Restarting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#8-ask-for-alternatives-when-you-need-options"&gt;8. Ask for Alternatives When You Need Options&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#9-break-complex-tasks-into-steps"&gt;9. Break Complex Tasks Into Steps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#10-common-mistakes-to-avoid"&gt;10. Common Mistakes to Avoid&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#11-quick-prompt-templates"&gt;11. Quick Prompt Templates&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#explain-something"&gt;Explain Something&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#rewrite-something"&gt;Rewrite Something&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#generate-ideas"&gt;Generate Ideas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#troubleshoot"&gt;Troubleshoot&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#12-practice-prompts"&gt;12. Practice Prompts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Foundations"></category></entry><entry><title>How to Evaluate A Company's AI Claims</title><link href="https://phroneses.com/articles/foundations/notes/evaluate-ai-claims.html" rel="alternate"></link><published>2025-01-01T00:00:00+00:00</published><updated>2025-01-01T00:00:00+00:00</updated><author><name>JH Evans</name></author><id>tag:phroneses.com,2025-01-01:/articles/foundations/notes/evaluate-ai-claims.html</id><summary type="html">&lt;p&gt;A framework for evaluating claims made about AI systems, focusing on evidence, capability, and verifiable performance.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="#toc"&gt;Table of contents&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="how-to-evaluate-claims-made-about-an-ai-based-system"&gt;How to Evaluate Claims Made About an AI-based System&lt;/h1&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Artificial intelligence now appears in many areas of daily life. It is used in
search engines, writing tools, customer service systems, healthcare
applications, and many other services. Many people encounter it without
thinking about it, such as when a phone suggests a reply to a message or when
an ecommerce website summarises customer feedback about a product.&lt;/p&gt;
&lt;p&gt;Public descriptions of systems based in part or whole on AI often highlight
ambitious capabilities.  Some describe their products as human level, fully
autonomous, or capable of replacing expert judgement.&lt;/p&gt;
&lt;p&gt;Promotional language and real performance do not always align, which makes it
useful to look closely at how such claims are formed.&lt;/p&gt;
&lt;h2 id="understanding-the-claim"&gt;Understanding the Claim&lt;/h2&gt;
&lt;p&gt;The first step is to understand what is actually being promised.&lt;/p&gt;
&lt;p&gt;Many statements about artificial intelligence are broad or ambiguous, so it is
useful to translate them into specific questions. A claim such as "our tool
detects fraud" sounds clear, but it raises many questions about what kind of
fraud, in what context, and with what level of accuracy.&lt;/p&gt;
&lt;p&gt;Many people begin by considering what task the system is meant to perform,
under what conditions it is expected to work, how well it performs that task,
and what it is being compared against. Once the claim is expressed in concrete
terms, it becomes much easier to evaluate.&lt;/p&gt;
&lt;h2 id="looking-for-evidence"&gt;Looking for Evidence&lt;/h2&gt;
&lt;p&gt;Claims about performance usually rest on some form of evidence. A credible
statement about artificial intelligence is supported by clear information about
how the system was tested.&lt;/p&gt;
&lt;p&gt;Independent evaluations, published research, recognised benchmarks, and real
world trials all provide meaningful support. For example, a reading
comprehension benchmark or a driving simulation can show how a system behaves
under controlled conditions. By contrast, phrases such as "industry leading
accuracy" or "our internal tests show excellent results" offer very little
without further detail.&lt;/p&gt;
&lt;p&gt;Reliability often depends on who carried out the measurement and how the
testing was designed.&lt;/p&gt;
&lt;h2 id="considering-the-data"&gt;Considering the Data&lt;/h2&gt;
&lt;p&gt;Every artificial intelligence system depends heavily on the data used to train
it.&lt;/p&gt;
&lt;p&gt;The quality, diversity, and representativeness of that data shape the system’s
strengths and weaknesses. A photo classifier trained mostly on daytime images
may struggle with night scenes, and a language tool trained mainly on formal
writing may find slang or informal messages difficult to interpret.&lt;/p&gt;
&lt;p&gt;When assessing a claim, it is worth asking whether the data reflects the real
world situations in which the system will be used. Narrow or unrepresentative
data can limit how well the system performs in real situations.&lt;/p&gt;
&lt;h2 id="recognising-limitations"&gt;Recognising Limitations&lt;/h2&gt;
&lt;p&gt;All systems have limitations, and responsible companies acknowledge them.&lt;/p&gt;
&lt;p&gt;It is helpful to look for information about situations where the system
performs poorly, where it may misinterpret inputs, or where it may produce
incorrect or misleading results. A voice assistant that mishears a request
because of background noise is a simple example of how small changes in
context can affect performance.&lt;/p&gt;
&lt;p&gt;Balanced descriptions usually include both strengths and known limitations.&lt;/p&gt;
&lt;h2 id="avoiding-human-like-descriptions-of-ai"&gt;Avoiding Human-like Descriptions of AI&lt;/h2&gt;
&lt;p&gt;Marketing language sometimes presents artificial intelligence in ways that
resemble human thinking.&lt;/p&gt;
&lt;p&gt;Words such as "understands", "reasons", or "knows" can create an impression
that the system possesses abilities it does not have. A system that predicts
the next word in a sentence may appear to "understand" the topic, but it is
following patterns rather than forming ideas.&lt;/p&gt;
&lt;p&gt;A more accurate approach is to focus on what the system actually does, how it
processes inputs, how it generates outputs, and how it behaves under different
conditions.&lt;/p&gt;
&lt;h2 id="seeking-independent-validation"&gt;Seeking Independent Validation&lt;/h2&gt;
&lt;p&gt;Independent evaluations often provide a clearer picture of how a system
performs.&lt;/p&gt;
&lt;p&gt;When researchers, regulators, journalists, or external auditors have examined a
system, their findings provide a valuable counterbalance to promotional
material.&lt;/p&gt;
&lt;p&gt;Real world deployment is equally important. A navigation app may work
perfectly in a staged demonstration, but everyday use can involve roadworks,
poor signal, or unexpected detours that reveal weaknesses.&lt;/p&gt;
&lt;p&gt;Genuine reliability is shown through consistent performance with diverse users
and unpredictable inputs.&lt;/p&gt;
&lt;h2 id="considering-the-consequences-of-error"&gt;Considering the Consequences of Error&lt;/h2&gt;
&lt;p&gt;It is important to consider the consequences of error. Some tasks are low risk,
while others involve significant personal, financial, or social impact.&lt;/p&gt;
&lt;p&gt;A system used for entertainment can tolerate occasional mistakes. A music
recommendation that misses the mark is usually harmless.&lt;/p&gt;
&lt;p&gt;A system used for medical advice, financial decisions, or legal interpretation
requires far stronger evidence and clear safeguards. A symptom checker that
offers an overly confident suggestion illustrates how errors can matter more in
high stakes settings.&lt;/p&gt;
&lt;p&gt;The impact of errors can vary widely, so the way a system handles mistakes
often shapes how it should be used.&lt;/p&gt;
&lt;h2 id="the-importance-of-transparency"&gt;The Importance of Transparency&lt;/h2&gt;
&lt;p&gt;Transparency and accountability are essential qualities.&lt;/p&gt;
&lt;p&gt;Companies who provide clear explanations, publish evaluation results, describe
limitations, and offer channels for feedback demonstrate a commitment to
responsible practice.&lt;/p&gt;
&lt;p&gt;Greater transparency makes it easier to understand how a system works and how
its results should be interpreted. For example, a tool that explains which
factors influenced a recommendation gives users a clearer sense of how to
interpret the output.&lt;/p&gt;
&lt;h2 id="a-practical-way-to-judge-a-claim"&gt;A Practical Way to Judge a Claim&lt;/h2&gt;
&lt;p&gt;These themes often lead people to consider questions about what is being
promised, what evidence supports it, and how the system behaves in real
conditions.&lt;/p&gt;
&lt;p&gt;It is useful to ask what is being promised, what evidence supports the promise,
who carried out the evaluation, what data was used, what limitations are
acknowledged, whether the system has been tested independently, how it performs
outside controlled demonstrations, and what the consequences are if it fails.&lt;/p&gt;
&lt;p&gt;This is a long list, but systems powered in some way by artificial intelligence
are becoming more common and tehy are having a larger impact on everyday life.o&lt;/p&gt;
&lt;p&gt;If we are all better placed to evaluate AI-based systems, the better.&lt;/p&gt;
&lt;p&gt;If several of these questions cannot be answered, any claim is possibly likely
to be overstated.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Artificial intelligence is a powerful set of technologies, but it is not magic.&lt;/p&gt;
&lt;p&gt;Careful consideration and evaluation makes it easier to distinguish genuine
progress from exaggerated claims.&lt;/p&gt;
&lt;h1 id="related-work"&gt;Related Work&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="how-ai-works.html"&gt;An explanation of how large language models actually function and why they should not be treated as miniature humans.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="what-ai-is.html"&gt;A clear explanation of what AI is—and is not—cutting through hype to define its real capabilities and limits.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="evaluate-ai-chatbot.html"&gt;A practical guide to assessing the quality, reliability, and safety of AI chat session outputs.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a id="toc"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;div class="toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#how-to-evaluate-claims-made-about-an-ai-based-system"&gt;How to Evaluate Claims Made About an AI-based System&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#introduction"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#understanding-the-claim"&gt;Understanding the Claim&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#looking-for-evidence"&gt;Looking for Evidence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#considering-the-data"&gt;Considering the Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#recognising-limitations"&gt;Recognising Limitations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#avoiding-human-like-descriptions-of-ai"&gt;Avoiding Human-like Descriptions of AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#seeking-independent-validation"&gt;Seeking Independent Validation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#considering-the-consequences-of-error"&gt;Considering the Consequences of Error&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-importance-of-transparency"&gt;The Importance of Transparency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-way-to-judge-a-claim"&gt;A Practical Way to Judge a Claim&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work"&gt;Related Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#table-of-contents"&gt;Table of Contents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</content><category term="Foundations"></category></entry></feed>