AI has increased delivery speed, but reliability becomes opaque without measurement. How to understand reliability for your own system?
Collect and act on metrics.
Why metrics matter
Metrics matter because they turn an unpredictable system into a manageable one. They give you reliable sight of what is really happening.
Organizational risk is increasing due to AI-driven delivery, and this can impact customer experience and your personal accountability.
Metrics support defensible governance during a period of rapid industry change.
Failure rate metrics
There is no industry-wide dataset that directly measures change failure rate before and after the adoption of AI for a delivery process that leads to production change.
To decide if your use of AI is making your delivery better or worse you need to measure your own situation to define a baseline.
To do this, you need to collect metrics that reflect the changing health of your delivery process and its effects on your production environment.
If this resonates, the newsletter continues the work. Subscribe
Four metrics
To surface a picture of your delivery and production health, you need to capture:
- Change failure rate
- MTTR
- MTTD
- Incident volume and severity
In the same way you have observability in place, the above metrics capture the relationship between how you produce solutions and what effect they have on your production system.
Change failure rate (CFR)
This tells you how often your changes break production.
Production reliability problems ultimately begin with one of two things:
- A change that should not have been deployed
- A change that was deployed correctly but behaved incorrectly
The first is a decision failure as the change was wrong before it reached production: the code was incorrect, incomplete or logically flawed.
The second is a system-interaction failure. The change was valid in isolation (all quality checks showed the change was good to go) but, once deployed, the change interacted with production in an unexpected way that was not picked up earlier.
When using AI, generated code can be:
- plausible but wrong
- incomplete
- inconsistent
- missing edge cases
- violating invariants
Such code will increase the number of changes that should not have been deployed.
AI use also increases change volume and this will affect more parts of your system. This increases:
- integration risk
- emergent behaviour
- subtle regressions
- interactions with legacy code
This increases the number of changes that behave incorrectly only after deployment.
Interpreting CFR
If CFR rises, you need to know did your use of AI generate more incorrect changes, or did AI accelerate delivery and expose more integration failures?
Without separating the two, you cannot attribute the cause.
Mean time to recovery
MTTR tells you how long you stay broken. It is the single best measure of operational resilience because if you are broken for 24 hours but your customers can still use your systems effectively, then you have a degree of resilience to system flaws as they do not negatively impact your customer.
Even if CFR stays constant, MTTR can become worse because using AI can introduce more subtle defects (that are likely to be missed in a large volume of code generation), and your system becomes harder to correct.
Consider this SQL statement whose behaviour only becomes clear when production table sizes are in the millions.
Subtle defects
Imagine your AI rewrites a database query for readability:
SELECT * FROM users WHERE id IN (SELECT user_id FROM sessions)
you test the above and everything passes. Test uses 50 rows for users and 10 for sessions.
But, in production, users contains 10 million rows and sessions contains 50,000,000.
Given this different data environment, the first thing that happens is that the subquery (in parentheses) becomes unbounded. It will return every user_id in sessions.
The database will create an in-memory version of sessions to check for value
membership. This is because the SQL query tests for this using IN.
Every value from sessions must be read.
Even when we consider both sessions to have been built once in memory (and a membership test costs one unit of time) and that users is indexed, every row in users must still be scanned. And sequential scans are inherently expensive.
The performance outcome in this case is that a sequential scan of users is expensive, but when the IN list is huge (50 million values), every alternative database query plan is even more expensive, so the database optimiser chooses the scan as the least costly option. A scan will be faster than the alternatives but such a scan is still costly in terms of input/output and CPU use.
A large sessions table leads to a huge IN list which means any index on
users is of less value. Because of this the database query optimiser scans the whole
of users. And in production, users contains 10,000,000 rows.
In short, the size of sessions trigger the database to choose a query plan that scans 10,000,000 rows.
This is a subtle issue to catch in test as test has used a tiny dataset.
But the key here is that the AI has no awareness of your production table sizes.
AI has written a theoretically correct query that breaks down when exposed to the realities of production.
There is more to writing code than just the text. A full awareness of the environment in which that text is running is required. And the AI does not have that awareness.
Your business becomes dependent on generated code that is not fully understood.
The interaction of two table sizes on performance
Engineers have an appreciation of this matrix.
The same query will operate differently each time it is run as the sizes of users and sessions vary. If they are both large, a worst case performance may occur.
Large here depends on the database you are using and the hardware environment it is running within.
The eventual performance of your code in dependent on factors outside of the code. This is why it is crucial to check the behaviour of your code in a test environment that is an accurate reflection of your production environment. Your engineers and QA staff are aware of this. Your AI is not.
| Table size | Small Sessions | Large Sessions |
|---|---|---|
| Small Users | • Fast query plans • Index use likely |
• Subquery grows large but outer scan still cheap • Hash table from sessions is large but overall, still manageable |
| Large Users | • Index use leads to good performance • Optimiser avoids full scan |
• This is the worst case: • A huge IN list and a full scan of a large users table • Result: a high query time • The optimiser is forced into a sequential scan on users |
Interpreting MTTR
If MTTR rises, users experience longer outages If MTTR falls, reliability is improving even if the change failure rate is unchanged
Mean time to detection
MTTD tells you how long you remain unaware that you are broken.
AI can affect this in two ways:
- more subtle regressions so a broken production is harder to detect
- more automated monitoring so a broken production is easier to detect
More subtle regressions
AI can generate code that can hide defects. And the defect may be subtle, as we have seen, because it only appears under:
- real production data, not engineer-run tests
- real concurrency, not local runs
- real load, not pre-production quality staging
plus:
- logs may not show anything suspicious
- any negative behaviour may be intermittent, so alerts do not fire
Interpreting MTTD
The system is broken, but nobody realises for longer. That is an increase in MTTD.
If MTTD increases, you are blind for longer. If MTTD decreases, you catch issues earlier.
Using MTTD to interpret MTTR
Without MTTD, you cannot interpret MTTR correctly. This is because MTTD is a component of MTTR.
MTTD is the time between a failure occurring and the failure being detected. This is the blindness window.
MTTR after failure detection is the time from failure detection to recovery. This is the repair window.
MTTR can refer to the sum of both these windows of time. But, the two windows behave differently, and they are influenced by AI in different ways.
Separating the two is important because if you only look at MTTR, you cannot tell whether:
- detection slowed down
- recovery slowed down
- both slowed down
- one improved while the other got worse
Two organisations with the same MTTR can have two totally different operational realities.
Affecting MTTD with two types of MTTR
There are two types of MTTR:
- detection-inclusive MTTR that includes the time you were unaware of the issue
- post-detection MTTR: how fast you fix things once you are aware of the issue
The first shows how long users were affected. The second describes how fast engineering can recover.
AI's effect on MTTD
AI can both increase and decrease MTTD, depending on how you use it.
AI can introduce subtle effects (as above with users and sessions database tables) that can:
- pass tests
- look plausible so passes code review
- only appear under real load or real data
- do not trigger alerts immediately
These will increase your blindness window.
AI can improve:
- anomaly detection
- log analysis
- metric correlation (not causation)
- alert generation
The improvements reduce the blindness window.
AI may increase or decrease the time for MTTR after detection, depending on whether:
- your use of AI helps engineers debug faster
- your use of AI produces code that is harder to reason about when run in production
And you cannot know which effect dominates unless you measure the components separately.
Incident Volume and Severity
This tells you how often and how badly things go wrong.
Even if your change failure rate and your mean time to recovery look stable, incident volume can rise because:
- deployment frequency increases
- AI accelerates code generation
- system complexity increases
- more third‑party dependencies fail
Incident volume is the only metric that captures the total operational load on the organisation.
How has your use of AI affected your business?
If you are using AI operationally, you can measure what effect it is having by considering these metrics.
Each metric must be normalised so that the before and after values reflect real changes in performance and are not due to changes in volume or other unrelated factors such as:
- team size
- deployment frequency
- service footprint
- code volume
- organisational growth
| Metric | Before AI | After AI | Normalisiation Explanation |
|---|---|---|---|
| Deployments/week | X | Y | Adjusted for team size and release cadence |
| Change Failure Rate | A% | B% | Calculated as failures per change, not absolute counts |
| MTTR | M | N | Split into detection and recovery components |
| P1/P2 incidents/month | U | V | Adjusted for deployment volume and service footprint |
| Lines of code changed | L1 | L2 | Normalised per engineer to remove team size effects |
| AI‑generated code (%) | 0% | K% | Expressed as a proportion of total code changes |
For example, without normalisation, the metrics might be misleading:
- if deployments double, incident count may rise even if quality improves
- if the team grows, code generation volume increase even without AI
- if the system footprint expands, MTTR may rise simply because more services exist
Consider this: you double deployments, and your number of incidents doubles. Your quality has remained the same. But if you double the number of deployments, and the number of incidents increases by 50%, your quality has improved. Normalization is essential to interpret metrics within the context of overall values.
In the US there are 131 million households and 86.9 million of them own a pet. In the UK, there are 28.2 million households, with 16.2 million owning a pet.
On the face of it, pet ownership in the US far outstrips the UK. However, we have to normalize this data by taking into account the different number of households: the US has 4.6 as many.
Taking this into account, 66% of US households own a pet, and in the UK, this figure is 57%. Therefore, once normalized by using pro-rata, the UK and US figures are shown to be broadly similar.
Why collect metrics?
Collecting and publishing metrics is essential because it replaces subjective, individual experience with objective, organisation‑wide evidence. Anecdotes such as "my use of AI has made it better for me" is useful feedback one-on-one but it cannot explain what is happening across your entire delivery system.
Conclusion
Reliability in an AI‑accelerated delivery system cannot be managed by intuition or anecdote.
To know whether AI is strengthening or weakening your production environment requires measuring the effects.
Change failure rate, MTTR, MTTD, and incident volume give you a clear and defensible view of how your production system is responding to any changes in your approach to software delivery.
As complexity rises and subtle failures become more common, these metrics become your foundation of operational truth. Tracking them will give you control.
Metrics give leaders a defensible view of how AI is reshaping delivery, reliability, and operational risk.
Read next: Before You Adopt AI in Engineering, Answer These Five Questions
Calculate your AI maturity with this article.
Related Articles
- The Missing Structure Agile Cannot Fix
- Building Safe, Compliant and Sustainable LLM Systems
- What Tech Executives Need to Know About Working With LLMs
If this was useful, you can get more pieces like it in the Phroneses newsletter.
I work with leaders and teams on clarity, capability, and momentum. Work with me →