LLMs can generate correct code, but they cannot safely change real software systems. The hard part of software delivery is not producing code but updating systems.
This article is not against AI. It is for engineering, and making the case that, as of May 2026, fully autonomous agents do not exist.
The hard part has always been preserving a system’s invariants while modifying a live, interdependent codebase. System invariants must always be in place, and if one is not, production fails.
The system is the environment that code executes within: the running processes, service-providing APIs; and the collective operational rules, failure behaviour, sequencing, and non-functional constraints such as performance, reliability, security and compliance.
How retries, timeouts, backoff, and idempotency are handled in code, and how they vary context to context, are examples of a company's preferred way of writing code for their system.
An LLM can write retry and backoff code, but an LLM cannot write the correct retry, timeout, backoff, or idempotency code for a particular system.
A better prompt is not the answer because a prompt can only tell the model what pattern to output. A prompt cannot give the model the system knowledge required to choose the correct pattern.
The Promise of Automated Software Delivery
By 2026, the imagined automated delivery workflow looks like this:
- read a repository
- understand the project structure
- plan a multi‑step change
- write code, tests, and docs
- run the code and fix mistakes
- produce a PR‑ready diff
The first three steps are additive: reading, mapping, planning. They do not alter the system’s behavioural causal structure. They do not change the system.
The last three are transformative. They change behaviour in a running system, and to get this right requires understanding constraints, invariants, and integration boundaries that the model cannot see, cannot infer, and cannot reason about.
Introducing new code into a production system changes that system. Updating the code or configuration of a production system is changing that system.
The LLM that produced the code has no sight of the running system.
Some kinds of change require less oversight than others. Adding new code is generally easier because the current system, by definition, has no reference to the new code until it is added. Adding new code is more self-contained.
Changing code is transformative and can be more diffi1cult as two invariants must hold: the code being changed must be correct (both in terms of meeting the language definition and its logic working correctly), and the change to that code must not break any other part of the system.
This additive‑vs‑transformative distinction is the core reason LLMs can assist, but cannot autonomously deliver software.
Agentic systems are becoming more autonomous. Some automated change can be performed but only for tightly controlled demos on simple code that is tens of lines long, not on real-world repositories with thousands of lines of code that have existed for years where dozens of people have updated it.
What the Labs Have Actually Delivered
The agentic work of OpenAI, Google, Cognition Labs, GitHub (Microsoft), Sourcegraph, JetBrains, Replit, Amazon, Meta, and Anthropic, that is listed in Further Reading, was published in 2023 and 2024.
Depending on where you look, you may have been given the impression that "fully automated agents are here". However, reality tells a different story.
Agents are improving, but are not yet production‑safe without significant human oversight.
A possible future for agentic systems is described by Mohamad Abou Ali and Fadi Dornaika in their PRISMA-based survey of 90 LLM studies covering 2018 to 2025. They state that the symbolic/classical systems surveyed can provide the type of reasoning required for an agentic system, whereas the neural/generative systems surveyed are sufficiently expressive. The promise of future agentic systems appears to lie in the integration of these two approaches.
Currently, LLMs can assist with software delivery, but they cannot own it, although improvements have been made to increase reasoning ability, such as OpenAI o1/o3, DeepMind AlphaGeometry and FunSearch, and Anthropic's Claude 3 "chain-of-thought".
Why is this?
LLMs generate statistically plausible continuations of text. This works well for self-contained tasks like writing a function or drafting documentation because these are pattern‑extension problems. But pattern‑matching to generate code does not address wider issues within the system that will be changed.
Software systems are causal: components depend on each other, invariants constrain behaviour, and changes propagate through the system.
The moment a task stops being additive and becomes system‑dependent, requiring dependency coherence, pattern‑matching is no longer the whole answer.
Currently, LLMs can imitate the shape of engineering work, but they cannot maintain a stable internal representation of a system that must be coherently changed, and that gap is exactly why LLMs fail the moment the task becomes system‑level.
If this resonates, the newsletter continues the work. Subscribe
Persistent state creates temporal dependencies
As soon as a change depends on:
- previous writes
- accumulated data
- cached values
- long‑lived objects
- external system state
any agentic model must reason about how the system got here and how it will behave after the change.
LLMs cannot maintain that internal causal chain.
Writing code to Agentic Systems: The Fundamental Gap
The gap becomes clear when you compare two activities: writing new code and modifying an existing system.
Code generation is local and additive: the model extends a pattern without needing to understand the system.
But agentic work is global and transformative: the LLM must change the system itself, which requires understanding dependencies, invariants, interactions, and downstream consequences.
This is causal reasoning, not pattern extension. LLMs predict tokens, not consequences — and that is why the leap from writing code to producing a safe, system‑aware PR‑ready diff is not incremental but a shift into a fundamentally different problem space.
Producing a PR‑ready diff
A pull request (PR) is a piece of code that will change a system.
For that change to be safe, the change must respect the system's current architecture, its intent, and all downstream consequences.
Software engineers work hard to ensure that such a change is safe through testing and their own judgement and experience before having a colleague review the change.
Applying a change is no longer pattern-matching but understanding causal behaviour: how will the system change if this PR is applied?
The correctness of the PR depends on understanding the whole system, not just generating text.
The LLM must change the system, which requires understanding dependencies, invariants, interactions, and consequences, all of which demand causal reasoning, not pattern-matching.
Pattern‑matching can write code; only causal reasoning can maintain systems.
State of the art, June 2026
As of June 2026, this is the state of the art in agentic software engineering. We are at the Task Owner stage that can produce a PR-ready diff that must be applied by a human. The next stage, that of Feature Owner, is emerging.
| Level | Capability | Real Today | Vendor Examples |
|---|---|---|---|
| Code Helper | Explain code, small edits, snippets | Yes | GitHub Copilot Chat, Claude, Cursor, Sourcegraph Cody |
| Task Assistant | Single‑file tasks, lint/format, simple tool use | Yes | Copilot Chat, Cursor, Cody, Aider |
| Change Executor | Multi‑file diffs, run tests, iterate on failures | Yes | Aider, Cody Enterprise, Cursor, Devin (partial) |
| Task Owner | Plans, multi‑step execution, PR‑ready changes | Yes | GitHub Copilot Workspace, Devin, Claude Projects + Tools, Cody Enterprise |
| Feature Owner | Workflow‑scale ownership, persistent state, CI integration | Emerging | GitHub Copilot Workspace (early), Devin (marketing claims), Claude Projects (partial) |
| System Collaborator | Architectural awareness, invariants, cross‑module reasoning | No | None |
| System Owner | Autonomous maintenance, long‑horizon refactors, incident mgmt | No | None |
Feature Owner is multi-task, whereas Task Owner is single task. Multiple tasks introduces the challenge of ensuring that multiple changes fit together as a coherent whole. Coherence requires architectural consistency, test strategy alignment, dependency awareness and the ability to work with performance and security constraints.
A bigger LLM model will not help in moving from single to mutiple tasks. Different behaviour is required and this missing behaviour is not "inside the model".
A viable mutlti-task solution may become feasible by wrapping an LLM in a stateful, constraint‑aware, tool‑driven agent architecture. This adds the necessary components around the LLM to go from single task to multi.
As an industry we have reached a hard limit of what an LLM can provide on its own. LLMs are fundamentally stateless, pattern-matching, next token predictors that, by design, have no internal causal model and no persistent memory. This means an LLM cannot maintain a plan, a task graph, invariants, constraints or the long-lived context necessary to maintain consistency across multiple pull requests.
This limit is the reason that to get to Feature Owner we need additional capabilities that are emerging as components built around the LLM. This is system egineering not model scaling.
What can I do?
Confirm for yourself any claim that you see. Define your own realistic real-world repository to work on, one that is thousands of lines of code, that has supported past real-world work patterns.
Having your own results, applied to your own repository will tell you volumes more than any press release or online anecdote.
For the moment:
- treat agentic AI as a strategic direction
- treat current tools as assistants, not engineers
- invest in clarity, architecture, and test discipline
- expect progress, but not miracles
- do not plan delivery pipelines around unproven capabilities
Maintain human judgement as the centre of the system.
The promise of agentic systems is intact; improvements have been made. But the evidence for fully autonomous agents is not yet here. The software delivery industry will have to adjust its approach based on what agentic tools can do and any future improvements.
Why this matters: code is cheap, judgement is not
LLM-augmented software delivery does not remove engineering.
It moves engineering up a level.
Humans need to focus on:
- intent
- constraints
- architecture
- correctness
- safety
- trade‑offs
The desired end state is not "AI writes code" but AI maintains systems. If we get there, humans will still need to maintain intent.
The consequence of an agentic system is not to remove engineering, but to elevate it, so that teams spend less time on mechanical construction and more time on judgement, alignment, and shaping the environment in which agents operate.
Fully agentic development promises a structural shift in how software is conceived, validated, and maintained.
Final Thought
Until AI can reason causally about an entire system, human judgement remains the foundation of software delivery.
Read next: Evaluating AI Systems: Metrics that Matter
A practical look at how to measure real behaviour in probabilistic systems.
Related Work
- The real gains for AI come from teams.
- Think in tokens or your system will break in production.
- AI is probabilistic; reliability comes from constraints.
If this was useful, you can get more pieces like it in the Phroneses newsletter.
I work with leaders and teams on clarity, capability, and momentum. Work with me →
Table of Contents
- The Promise of Automated Software Delivery
- What the Labs Have Actually Delivered
- Why is this?
- Persistent state creates temporal dependencies
- Writing code to Agentic Systems: The Fundamental Gap
- Producing a PR‑ready diff
- State of the art, June 2026
- What can I do?
- Why this matters: code is cheap, judgement is not
- Final Thought
- Related Work
- Table of Contents
- Further Reading
Further Reading
-
Abou Ali, M., & Dornaika, F. (2025) Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions.
https://doi.org/10.48550/arXiv.2510.25445 -
Aider
https://aider.chat -
AI Assistant in JetBrains IDEs, JetBrains, December, 2023
https://blog.jetbrains.com/blog/2023/12/06/jetbrains-ai-assistant-is-now-available/ -
Amazon CodeWhisperer, Amazon, April, 2023
https://aws.amazon.com/codewhisperer/ -
Anthropic Claude
https://www.anthropic.com/claude -
Claude 3 Code Reasoning, Anthropic, March, 2024
https://www.anthropic.com/news/claude-3-family -
Code Llama, Meta, August, 2023
https://ai.meta.com/blog/code-llama-large-language-model-coding/ -
Cody, Sourcegraph, April, 2024
https://sourcegraph.com/blog/cody-2-0 -
Cognition Labs Devin
https://www.cognition-labs.com -
Cursor
https://cursor.com -
Devin, Cognition Labs, March, 2024
https://www.cognition-labs.com/ -
Gemini Code Demos, Google, December, 2023
https://blog.google/technology/ai/google-gemini-ai/ -
GitHub Copilot, GitHub (Microsoft), November, 2023
https://github.blog/2023-11-08-the-new-github-copilot-your-ai-pair-programmer/ -
GitHub Copilot Chat
https://github.com/features/copilot -
GitHub Copilot Workspace
https://github.com/features/copilot -
OpenAI o1/o3, OpenAI, September, 2024
https://openai.com/index/introducing-openai-o1-preview/ -
Replit Agents, Replit, November, 2023
https://blog.replit.com/agents -
Sourcegraph Cody Enterprise
https://sourcegraph.com/cody