The Limits of Stateless LLMs

We have reached the end of major gains from large language models (LLMs). LLMs are fundamentally stateless, pattern-matching, next-token predictors that, by design, have no internal causal model and no persistent memory.

Systems that require a plan, the coherent handling of constraints, or the use of a long-lived context to maintain consistency across several updates, are all turning towards building such components around the LLM. No longer will a larger model suffice.

An LLM can answer medical questions, but it cannot: maintain patient state, track longitudinal data, enforce clinical guidelines or reason about drug interactions. Medical LLM products are shifting focus towards clinical decision engines, structured patient records, safety layers and protocol checkers.

AIv2 will require great product and software engineering to give industries what they need: higher-level planning and multi-stage coherence.

The Limits of Stateless LLMs

As of June 2026, the Software Engineering industry has benefited from the rapid gains in LLM capability based on building bigger, rich models capable of accurately matching an ever larger number of patterns. Today, autonomous AI Engineers excel at single-task ownership: the ability to complete a single software engineering task so that a human software engineer can decide whether that change should be applied to the wider system.

The next stage for autonomous AI engineers is to excel at Feature Ownership: the ability to complete the multiple tasks necessary to ensure a whole feature can be safely applied to the wider system. A Feature requires multiple tasks such as safely update the code that works in a web browser; update the backend code the browser code interacts with; and update any databases necessary to store feature data. All of these changes must be performed in a mutually consistent way so that one component of the new feature does not break another aspect of that same feature, and so any aspect of the new feature does not break anything already working in the current system.

LLMs are fundamentally stateless, pattern-matching, next-token predictors that, by design, have no internal causal model and no persistent memory. This means an LLM cannot maintain a plan, a task graph, invariants, constraints or the long-lived context necessary to maintain consistency across multiple updates.

The safe update of the system to incorporate the multi-task feature requires: persistent state, multi-step reasoning, constraint enforcement, and cross-task coherence. These are all part of long-horizon enforcement: the capability to ensure that a plan, a set of constraints, or a set of invariants remains true across an extended sequence of tasks.

Building a bigger model will not provide the necessary solution to accurately handle this. Therefore, vendors in the AI autonomous engineer space are currently building components around an LLM to provide the necessary support.

Single‑task autonomy is solved but multi‑task feature coherence remains unsolved.

We have reached a hard limit of what an LLM can provide on its own.

This Pattern in Other Industries

LLMs can do one step but, due to their design, cannot maintain state across steps, nor can they enforce constraints or ensure coherence across steps.

We see this LLM hard limit in other industries.

Robotics

Robotics teams report that LLMs alone cannot: maintain a world model, track state across multiple steps, plan reliably or adapt to unexpected changes.

Work such as CodeAct and others are adding symbolic planners, state estimators, constraint solvers and safety layers around the LLM to provide the missing abilities.

LLMs can summarise single contracts, but they cannot: maintain consistency across multiple documents, track obligations across clauses, enforce regulatory constraints or reason about dependencies between legal entities.

Rules engines, knowledge graphs, audit trails and structured memory (e.g., the work on CAUD) are being added to LLMs.

From the work of CAUD by Hendrycks and others, many real-world document analysis tasks still do not make use of machine learning. Whether these large models can transfer to highly specialized domains remains an open question. To resolve this question, large specialized datasets are necessary.

Healthcare decision support

An LLM can answer medical questions, but it cannot: maintain patient state, track longitudinal data, enforce clinical guidelines or reason about drug interactions.

In their work on MedAction, Hsin-Ling Hsu and others report three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. Such evolving evidence implies multiple steps.

Clinical decision engines, structured patient records, safety layers and protocol checkers are being put into place to address these issues.

Finance and trading

LLMs can analyse a single report, but they cannot: maintain portfolio state, track risk constraints, reason about multi‑step strategies or enforce regulatory rules.

As reported by Mahdavi and others in their survey (citation 11), the work of Zhixuan Chu and others has been to provide a data-centric approach to enhance the efficacy of LLMs for financial tasks by addressing their limitations in integrating and reasoning complex financial data.

Firms are adding risk engines, rule‑based planners, and stateful trading systems to address this.

Customer support automation

LLMs can answer a single ticket, but they cannot: maintain a case history over multiple tickets, track multi‑step resolutions, enforce policy constraints or coordinate across channels.

Balaji and others state that across 703 conversations spanning three domains, their structured workflow orchestration (Dynamic-Prompt-Agent) significantly outperforms prompt-based approaches, enabling even smaller models to exceed larger ones in policy compliance.

To address these, companies are adding integrated memory to customer relationship management, workflow engines, and policy validators.

Conclusion

LLMs are fundamentally stateless, pattern-matching, next-token predictors. They have no internal causal model and no persistent memory.

This design has gotten us this far. LLMs excel at single-stage tasks, but when stepping into the richer domain of multiple steps, pattern-matching on its own is no longer enough.

Vendors are now building the missing capabilities around their LLMs.

We are now into AIv2. AIv1 was pattern-matching, single task. Version 2 will take this further to address multiple stages and high-level capabilities such as coherence across steps.

Current products build support for v2 around the v1 LLM. The low-hanging fruit of v1 has been consumed. The leading products of tomorrow rely on engineering ability and sound product decisions.

Read next: The Big AI Gains Come From Teams, Not Individuals
AI can help individuals but bigger improvements will come from assisting the whole team

Related Articles

If this was useful, you can get more pieces like it in the Phroneses newsletter.

Subscribe →

I work with leaders and teams on clarity, capability, and momentum. Work with me →

Table of Contents

Further Reading

  • CodeAct: Executable Code Actions Elicit Better LLM Agents https://arxiv.org/abs/2402.01030

  • CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review https://arxiv.org/abs/2103.06268

  • MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs https://arxiv.org/html/2605.07305v1

  • Mahdavi and others: Integrating Large Language Models in Financial Investments and Market Analysis: A Survey
    https://arxiv.org/pdf/2507.01990

  • Sumanth Balaji and others: Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
    https://arxiv.org/pdf/2601.00596

\