Measuring Reliability in the Age of AI

AI has increased delivery speed, but reliability becomes opaque without measurement. How to understand reliability for your own system?

Collect and act on metrics.

Why metrics matter

Metrics matter because they turn an unpredictable system into a manageable one. They give you reliable sight of what is really happening.

Organizational risk is increasing due to AI-driven delivery, and this can impact customer experience and your personal accountability.

Metrics support defensible governance during a period of rapid industry change.

Failure rate metrics

There is no industry-wide dataset that directly measures change failure rate before and after the adoption of AI for a delivery process that leads to production change.

To decide if your use of AI is making your delivery better or worse you need to measure your own situation to define a baseline.

To do this, you need to collect metrics that reflect the changing health of your delivery process and its effects on your production environment.

If this resonates, the newsletter continues the work. Subscribe

Four metrics

To surface a picture of your delivery and production health, you need to capture:

Change failure rate
MTTR
MTTD
Incident volume and severity

In the same way you have observability in place, the above metrics capture the relationship between how you produce solutions and what effect they have on your production system.

Change failure rate (CFR)

This tells you how often your changes break production.

Production reliability problems ultimately begin with one of two things:

A change that should not have been deployed
A change that was deployed correctly but behaved incorrectly

The first is a decision failure as the change was wrong before it reached production: the code was incorrect, incomplete or logically flawed.

The second is a system-interaction failure. The change was valid in isolation (all quality checks showed the change was good to go) but, once deployed, the change interacted with production in an unexpected way that was not picked up earlier.

When using AI, generated code can be:

plausible but wrong
incomplete
inconsistent
missing edge cases
violating invariants

Such code will increase the number of changes that should not have been deployed.

AI use also increases change volume and this will affect more parts of your system. This increases:

integration risk
emergent behaviour
subtle regressions
interactions with legacy code

This increases the number of changes that behave incorrectly only after deployment.

Interpreting CFR

If CFR rises, you need to know did your use of AI generate more incorrect changes, or did AI accelerate delivery and expose more integration failures?

Without separating the two, you cannot attribute the cause.

Mean time to recovery

MTTR tells you how long you stay broken. It is the single best measure of operational resilience because if you are broken for 24 hours but your customers can still use your systems effectively, then you have a degree of resilience to system flaws as they do not negatively impact your customer.

Even if CFR stays constant, MTTR can become worse because using AI can introduce more subtle defects (that are likely to be missed in a large volume of code generation), and your system becomes harder to correct.

Consider this SQL statement whose behaviour only becomes clear when production table sizes are in the millions.

Subtle defects

Imagine your AI rewrites a database query for readability:

SELECT * FROM users WHERE id IN (SELECT user_id FROM sessions)

you test the above and everything passes. Test uses 50 rows for users and 10 for sessions.

But, in production, users contains 10 million rows and sessions contains 50,000,000.

Given this different data environment, the first thing that happens is that the subquery (in parentheses) becomes unbounded. It will return every user_id in sessions.

The database will create an in-memory version of sessions to check for value membership. This is because the SQL query tests for this using IN. Every value from sessions must be read.

Even when we consider both sessions to have been built once in memory (and a membership test costs one unit of time) and that users is indexed, every row in users must still be scanned. And sequential scans are inherently expensive.

The performance outcome in this case is that a sequential scan of users is expensive, but when the IN list is huge (50 million values), every alternative database query plan is even more expensive, so the database optimiser chooses the scan as the least costly option. A scan will be faster than the alternatives but such a scan is still costly in terms of input/output and CPU use.

A large sessions table leads to a huge IN list which means any index on users is of less value. Because of this the database query optimiser scans the whole of users. And in production, users contains 10,000,000 rows.

In short, the size of sessions trigger the database to choose a query plan that scans 10,000,000 rows.

This is a subtle issue to catch in test as test has used a tiny dataset.

But the key here is that the AI has no awareness of your production table sizes.

AI has written a theoretically correct query that breaks down when exposed to the realities of production.

There is more to writing code than just the text. A full awareness of the environment in which that text is running is required. And the AI does not have that awareness.

Your business becomes dependent on generated code that is not fully understood.

The interaction of two table sizes on performance

Engineers have an appreciation of this matrix.

The same query will operate differently each time it is run as the sizes of users and sessions vary. If they are both large, a worst case performance may occur.

Large here depends on the database you are using and the hardware environment it is running within.

The eventual performance of your code in dependent on factors outside of the code. This is why it is crucial to check the behaviour of your code in a test environment that is an accurate reflection of your production environment. Your engineers and QA staff are aware of this. Your AI is not.

Table size	Small Sessions	Large Sessions
Small Users	• Fast query plans • Index use likely	• Subquery grows large but outer scan still cheap • Hash table from sessions is large but overall, still manageable
Large Users	• Index use leads to good performance • Optimiser avoids full scan	• This is the worst case: • A huge IN list and a full scan of a large users table • Result: a high query time • The optimiser is forced into a sequential scan on users

Interpreting MTTR

If MTTR rises, users experience longer outages If MTTR falls, reliability is improving even if the change failure rate is unchanged

Mean time to detection

MTTD tells you how long you remain unaware that you are broken.

AI can affect this in two ways:

more subtle regressions so a broken production is harder to detect
more automated monitoring so a broken production is easier to detect

More subtle regressions

AI can generate code that can hide defects. And the defect may be subtle, as we have seen, because it only appears under:

real production data, not engineer-run tests
real concurrency, not local runs
real load, not pre-production quality staging

plus:

logs may not show anything suspicious
any negative behaviour may be intermittent, so alerts do not fire

Interpreting MTTD

The system is broken, but nobody realises for longer. That is an increase in MTTD.

If MTTD increases, you are blind for longer. If MTTD decreases, you catch issues earlier.

Using MTTD to interpret MTTR

Without MTTD, you cannot interpret MTTR correctly. This is because MTTD is a component of MTTR.

MTTD is the time between a failure occurring and the failure being detected. This is the blindness window.

MTTR after failure detection is the time from failure detection to recovery. This is the repair window.

MTTR can refer to the sum of both these windows of time. But, the two windows behave differently, and they are influenced by AI in different ways.

Separating the two is important because if you only look at MTTR, you cannot tell whether:

detection slowed down
recovery slowed down
both slowed down
one improved while the other got worse

Two organisations with the same MTTR can have two totally different operational realities.

Affecting MTTD with two types of MTTR

There are two types of MTTR:

detection-inclusive MTTR that includes the time you were unaware of the issue
post-detection MTTR: how fast you fix things once you are aware of the issue

The first shows how long users were affected. The second describes how fast engineering can recover.

AI's effect on MTTD

AI can both increase and decrease MTTD, depending on how you use it.

AI can introduce subtle effects (as above with users and sessions database tables) that can:

pass tests
look plausible so passes code review
only appear under real load or real data
do not trigger alerts immediately

These will increase your blindness window.

AI can improve:

anomaly detection
log analysis
metric correlation (not causation)
alert generation

The improvements reduce the blindness window.

AI may increase or decrease the time for MTTR after detection, depending on whether:

your use of AI helps engineers debug faster
your use of AI produces code that is harder to reason about when run in production

And you cannot know which effect dominates unless you measure the components separately.

Incident Volume and Severity

This tells you how often and how badly things go wrong.

Even if your change failure rate and your mean time to recovery look stable, incident volume can rise because:

deployment frequency increases
AI accelerates code generation
system complexity increases
more third‑party dependencies fail

Incident volume is the only metric that captures the total operational load on the organisation.

How has your use of AI affected your business?

If you are using AI operationally, you can measure what effect it is having by considering these metrics.

Each metric must be normalised so that the before and after values reflect real changes in performance and are not due to changes in volume or other unrelated factors such as:

team size
deployment frequency
service footprint
code volume
organisational growth

Metric	Before AI	After AI	Normalisiation Explanation
Deployments/week	X	Y	Adjusted for team size and release cadence
Change Failure Rate	A%	B%	Calculated as failures per change, not absolute counts
MTTR	M	N	Split into detection and recovery components
P1/P2 incidents/month	U	V	Adjusted for deployment volume and service footprint
Lines of code changed	L1	L2	Normalised per engineer to remove team size effects
AI‑generated code (%)	0%	K%	Expressed as a proportion of total code changes

For example, without normalisation, the metrics might be misleading:

if deployments double, incident count may rise even if quality improves
if the team grows, code generation volume increase even without AI
if the system footprint expands, MTTR may rise simply because more services exist

Consider this: you double deployments, and your number of incidents doubles. Your quality has remained the same. But if you double the number of deployments, and the number of incidents increases by 50%, your quality has improved. Normalization is essential to interpret metrics within the context of overall values.

In the US there are 131 million households and 86.9 million of them own a pet. In the UK, there are 28.2 million households, with 16.2 million owning a pet.

On the face of it, pet ownership in the US far outstrips the UK. However, we have to normalize this data by taking into account the different number of households: the US has 4.6 as many.

Taking this into account, 66% of US households own a pet, and in the UK, this figure is 57%. Therefore, once normalized by using pro-rata, the UK and US figures are shown to be broadly similar.

Why collect metrics?

Collecting and publishing metrics is essential because it replaces subjective, individual experience with objective, organisation‑wide evidence. Anecdotes such as "my use of AI has made it better for me" is useful feedback one-on-one but it cannot explain what is happening across your entire delivery system.

Conclusion

Reliability in an AI‑accelerated delivery system cannot be managed by intuition or anecdote.

To know whether AI is strengthening or weakening your production environment requires measuring the effects.

Change failure rate, MTTR, MTTD, and incident volume give you a clear and defensible view of how your production system is responding to any changes in your approach to software delivery.

As complexity rises and subtle failures become more common, these metrics become your foundation of operational truth. Tracking them will give you control.

Metrics give leaders a defensible view of how AI is reshaping delivery, reliability, and operational risk.

Read next: Before You Adopt AI in Engineering, Answer These Five Questions
Calculate your AI maturity with this article.

If this was useful, you can get more pieces like it in the Phroneses newsletter.

Subscribe →

I work with leaders and teams on clarity, capability, and momentum. Work with me →

Why metrics matter
Failure rate metrics
Four metrics
How has your use of AI affected your business?
Why collect metrics?
Conclusion
Related Articles
Table of Contents