How to Evaluate Claims Made About an AI-based System
Introduction
Artificial intelligence now appears in many areas of daily life. It is used in search engines, writing tools, customer service systems, healthcare applications, and many other services. Many people encounter it without thinking about it, such as when a phone suggests a reply to a message or when an ecommerce website summarises customer feedback about a product.
Public descriptions of systems based in part or whole on AI often highlight ambitious capabilities. Some describe their products as human level, fully autonomous, or capable of replacing expert judgement.
Promotional language and real performance do not always align, which makes it useful to look closely at how such claims are formed.
Understanding the Claim
The first step is to understand what is actually being promised.
Many statements about artificial intelligence are broad or ambiguous, so it is useful to translate them into specific questions. A claim such as "our tool detects fraud" sounds clear, but it raises many questions about what kind of fraud, in what context, and with what level of accuracy.
Many people begin by considering what task the system is meant to perform, under what conditions it is expected to work, how well it performs that task, and what it is being compared against. Once the claim is expressed in concrete terms, it becomes much easier to evaluate.
Looking for Evidence
Claims about performance usually rest on some form of evidence. A credible statement about artificial intelligence is supported by clear information about how the system was tested.
Independent evaluations, published research, recognised benchmarks, and real world trials all provide meaningful support. For example, a reading comprehension benchmark or a driving simulation can show how a system behaves under controlled conditions. By contrast, phrases such as "industry leading accuracy" or "our internal tests show excellent results" offer very little without further detail.
Reliability often depends on who carried out the measurement and how the testing was designed.
Considering the Data
Every artificial intelligence system depends heavily on the data used to train it.
The quality, diversity, and representativeness of that data shape the system’s strengths and weaknesses. A photo classifier trained mostly on daytime images may struggle with night scenes, and a language tool trained mainly on formal writing may find slang or informal messages difficult to interpret.
When assessing a claim, it is worth asking whether the data reflects the real world situations in which the system will be used. Narrow or unrepresentative data can limit how well the system performs in real situations.
Recognising Limitations
All systems have limitations, and responsible companies acknowledge them.
It is helpful to look for information about situations where the system performs poorly, where it may misinterpret inputs, or where it may produce incorrect or misleading results. A voice assistant that mishears a request because of background noise is a simple example of how small changes in context can affect performance.
Balanced descriptions usually include both strengths and known limitations.
Avoiding Human-like Descriptions of AI
Marketing language sometimes presents artificial intelligence in ways that resemble human thinking.
Words such as "understands", "reasons", or "knows" can create an impression that the system possesses abilities it does not have. A system that predicts the next word in a sentence may appear to "understand" the topic, but it is following patterns rather than forming ideas.
A more accurate approach is to focus on what the system actually does, how it processes inputs, how it generates outputs, and how it behaves under different conditions.
Seeking Independent Validation
Independent evaluations often provide a clearer picture of how a system performs.
When researchers, regulators, journalists, or external auditors have examined a system, their findings provide a valuable counterbalance to promotional material.
Real world deployment is equally important. A navigation app may work perfectly in a staged demonstration, but everyday use can involve roadworks, poor signal, or unexpected detours that reveal weaknesses.
Genuine reliability is shown through consistent performance with diverse users and unpredictable inputs.
Considering the Consequences of Error
It is important to consider the consequences of error. Some tasks are low risk, while others involve significant personal, financial, or social impact.
A system used for entertainment can tolerate occasional mistakes. A music recommendation that misses the mark is usually harmless.
A system used for medical advice, financial decisions, or legal interpretation requires far stronger evidence and clear safeguards. A symptom checker that offers an overly confident suggestion illustrates how errors can matter more in high stakes settings.
The impact of errors can vary widely, so the way a system handles mistakes often shapes how it should be used.
The Importance of Transparency
Transparency and accountability are essential qualities.
Companies who provide clear explanations, publish evaluation results, describe limitations, and offer channels for feedback demonstrate a commitment to responsible practice.
Greater transparency makes it easier to understand how a system works and how its results should be interpreted. For example, a tool that explains which factors influenced a recommendation gives users a clearer sense of how to interpret the output.
A Practical Way to Judge a Claim
These themes often lead people to consider questions about what is being promised, what evidence supports it, and how the system behaves in real conditions.
It is useful to ask what is being promised, what evidence supports the promise, who carried out the evaluation, what data was used, what limitations are acknowledged, whether the system has been tested independently, how it performs outside controlled demonstrations, and what the consequences are if it fails.
This is a long list, but systems powered in some way by artificial intelligence are becoming more common and tehy are having a larger impact on everyday life.o
If we are all better placed to evaluate AI-based systems, the better.
If several of these questions cannot be answered, any claim is possibly likely to be overstated.
Conclusion
Artificial intelligence is a powerful set of technologies, but it is not magic.
Careful consideration and evaluation makes it easier to distinguish genuine progress from exaggerated claims.
Related Work
- An explanation of how large language models actually function and why they should not be treated as miniature humans.
- A clear explanation of what AI is—and is not—cutting through hype to define its real capabilities and limits.
- A practical guide to assessing the quality, reliability, and safety of AI chat session outputs.
If this piece was useful, you’ll appreciate the free Phroneses newsletter — clear thinking on engineering leadership, organisational clarity, and reliable systems. Practical, honest, and built for people who care about doing the work well.
I work with leaders and teams on clarity, capability, and momentum. Work with me →