LLMs can generate correct code, but they still cannot safely change real software systems. The hard part of software delivery is not producing code; it is preserving a system’s invariants while modifying a live, interdependent codebase.

The Promise of Automated Software Delivery

By 2026, the imagined workflow looks like this:

read a repository
understand the project structure
plan a multi‑step change
write code, tests, and docs
run the code and fix mistakes
produce a PR‑ready diff

The first three steps are additive: reading, mapping, planning. They do not alter the system’s causal structure.

The last three are transformative. They change behaviour in a running system, and that requires understanding constraints, invariants, and integration boundaries that the model cannot see, cannot infer, and cannot reason about.

That gap — between pattern‑matching code and understanding system‑level consequences — is where current LLMs fail.

Applying new code is self‑contained, additive work. Modifying an existing system is transformative work that depends on understanding dependencies, invariants, and consequences. This additive‑vs‑transformative distinction is the core reason LLMs can assist, but cannot autonomously deliver software Parts of the above can be done but only for tightly controlled demos on simple code that is tens of lines long, not on real-world repositories with thousands of lines of code that has existed for years where dozens of people have updated it.

What the Labs Have Actually Delivered

The agentic work of OpenAI, Google, Cognition Labs, GitHub (Microsoft), Sourcegraph, JetBrains, Replit, Amazon, Meta, and Anthropic, that is listed in Further Reading, was published in 2023 and 2024.

Depending on where you look, you may have been given another impression: that "agents are here". However, reality tells a different story.

Agents are improving, but are not reliable, not autonomous, and not production‑safe.

LLMs can assist with software delivery, but they cannot own it.

Why is this?

LLMs generate statistically plausible continuations of text. This works well for self-contained tasks like writing a function or drafting documentation because these are pattern‑extension problems. But pattern‑matching is not system understanding, and plausibility is not correctness.

Software systems are causal: components depend on each other, invariants constrain behaviour, and changes propagate through the system. The moment a task stops being self‑contained and becomes system‑dependent — requiring dependency coherence, persistent state, or awareness of how changes ripple through a real codebase — pattern‑matching is no longer sufficient.

Currently, LLMs can imitate the shape of engineering work, but they cannot maintain a stable internal representation of a system that must be coherently changed, and that gap is exactly why LLMs fail the moment the task becomes system‑level.

Persistent state creates temporal dependencies

A self‑contained task has no past and no future. A system‑dependent task does.

As soon as a change depends on:

previous writes
accumulated data
cached values
long‑lived objects
external system state

any agentic model must reason about how the system got here and how it will behave after the change.

LLMs cannot maintain that internal causal chain.

Writing code to Agentic Systems: The Fundamental Gap

The gap becomes clear when you compare two activities: writing new code and modifying an existing system.

Code generation is local and additive: the model extends a pattern without needing to understand the system.

But agentic work is global and transformative: the LLM must change the system itself, which requires understanding dependencies, invariants, interactions, and downstream consequences.

This is causal reasoning, not pattern extension. LLMs predict tokens, not consequences — and that is why the leap from writing code to producing a safe, system‑aware PR‑ready diff is not incremental but a shift into a fundamentally different problem space.

Producing a PR‑ready diff (the section in question)

A pull request (PR) is a piece of code that will change a system.

For that change to be safe, the change must respect the system's current architecture, its intent, and all downstream consequences.

Software engineers work hard to ensure that such a change is safe through testing and their own judgement and experience before having a collegue review the change.

Applying a change is no longer pattern-matching but understanding causal behaviour: how will the system change if this PR is applied?

The correctness of the PR depends on understanding the whole system, not just generating text.

The LLM must change the system, which requires understanding dependencies, invariants, interactions and consequences, all of which demand causal reasoning, not pattern matching.

Pattern‑matching can write code; only causal reasoning can maintain systems.

What can I do?

Confirm for yourself any claim that you see. Define your own realistic real-world repository to work on, one that is thousands of lines of code, that has supported past real-world work patterns.

Having your own results, applied to your own repository will tell you volumes more than any press release or online anecdote.

For the moment:

treat agentic AI as a strategic direction
treat current tools as assistants, not engineers
invest in clarity, architecture, and test discipline
expect progress, but not miracles
do not plan delivery pipelines around unproven capabilities

Maintain human judgement as the centre of the system.

The dream is intact. The evidence is not yet here.

Why this matters: code is cheap, judgement is not

LLM-augmented software delivery does not remove engineering.

It moves engineering up a level.

Humans need to focus on:

intent
constraints
architecture
correctness
safety
trade‑offs

The desired end state is not "AI writes code" but AI maintains systems. If we get there, humans will still need to maintain intent.

The consequence of an agentic system is not to remove engineering, but to elevate it, so that teams spend less time on mechanical construction and more time on judgement, alignment, and shaping the environment in which agents operate.

The organisations that benefit most will be those that treat agentic development not as automation, but as a structural shift in how software is conceived, validated, and maintained.

Final Thought

Until AI can reason causally about systems, human judgement remains the foundation of software delivery.

If this piece was useful, you’ll appreciate the free Phroneses newsletter — clear thinking on engineering leadership, organisational clarity, and reliable systems. Practical, honest, and built for people who care about doing the work well.

Subscribe to the newsletter →

I work with leaders and teams on clarity, capability, and momentum. Work with me →

The Promise of Automated Software Delivery
What the Labs Have Actually Delivered
Why is this?
Persistent state creates temporal dependencies
Writing code to Agentic Systems: The Fundamental Gap
Producing a PR‑ready diff (the section in question)
What can I do?
Why this matters: code is cheap, judgement is not
Final Thought
Related Work
Table of Contents
Further Reading

Agents Cannot Maintain Systems: The Additive–Transformative Gap in LLM Software Delivery

Jh Evans

The Promise of Automated Software Delivery

What the Labs Have Actually Delivered

Why is this?

Persistent state creates temporal dependencies

Writing code to Agentic Systems: The Fundamental Gap

Producing a PR‑ready diff (the section in question)

What can I do?

Why this matters: code is cheap, judgement is not

Final Thought

Table of Contents

Further Reading

The Promise of Automated Software Delivery

What the Labs Have Actually Delivered

Why is this?

Persistent state creates temporal dependencies

Writing code to Agentic Systems: The Fundamental Gap

Producing a PR‑ready diff (the section in question)

What can I do?

Why this matters: code is cheap, judgement is not

Final Thought

Related Work

Table of Contents

Further Reading