Programmatic Interfaces to AI Systems
We interact with AI systems through natural language. As engineers, we are used to structured and predictable interfaces such as REST or gRPC.
AI systems do not behave like that. Their outputs are probabilistic, and this creates real challenges when we try to use them as components inside software systems.
Most current models behave like chat interfaces. What we need are models that behave like reliable parts of an application.
This article explains what is currently practical and how to build interfaces that bring AI systems closer to the expectations of software engineering.
The Challenge
Large language models (LLMs) generate text by predicting the next token. They are not rules engines, parsers, or deterministic programs.
An LLM's output is a probability distribution over the next token. The distribution depends on the prompt, any conversation history you include, the model’s internal weights, and the sampling parameters.
Even with strict instructions, the model still performs this operation:
"Select the next token that has the highest probability given the input so far."
That is probability, not logic.
The practical approach is to apply prompt constraints that reduce the likelihood of outputs that are not fit for purpose.
Prompt Constraints
An LLM may return a result that does not fit the calling side. This is a failure mode of the model.
Each of the eight layers reduces the likelihood of a specific failure mode. Together, they form a structured interface between the client code and the model.
This approach will make your code more:
- predictable
- grounded in the provided context
- structured in both input and output
- controllable through explicit constraints
Because LLMs are probabilistic, these layers cannot eliminate failure modes.
Other failure modes exist, but they are outside the scope of this section. The focus here is on the eight layers that address the most common issues.
The Eight Layers
- Identity
- Safety & Compliance
- Capability Boundaries
- Output Format
- Citation Rules
- RAG Grounding
- Reasoning Strategy
- Task Logic
1. Identity
Identity anchors the model’s role and prevents behavioural drift. Without a stable identity, the model may shift tone, adopt unintended personas, or answer outside its intended domain. This layer establishes what the model is and what it is not, providing the behavioural foundation for all the layers below.
2. Safety & Compliance
Safety and compliance constraints ensure the model minimises harmful, disallowed, or high‑risk content. This protects users, organisations, and downstream systems. It is essential for any public‑facing or regulated deployment. This helps to ensure that the model behaves within acceptable boundaries.
3. Capability Boundaries
LLMs tend to overreach. They might claim abilities they do not have or fabricate tools, APIs, or actions. This layer reduces the likelihood that the model will perform operations outside its scope. It keeps the system more honest, more predictable, and aligned with its real capabilities.
4. Output Format
Programmatic systems require structured, unambiguous, machine‑readable output. This layer enforces schemas, reduces the likelihood of format drift, and helps to ensure downstream components can reliably parse responses. It helps move the model away from a conversational agent towards a dependable software component.
5. Citation Rules
Citation rules enforce traceability and verifiability.
This layer reduces the likelihood of fabricated sources, invented URLs, and unsupported claims. This layer is essential for any system that must justify its answers or provide evidence for its statements.
6. RAG Grounding
RAG grounding ensures the model uses only the supplied context as its source of truth. It damps down hallucinations by binding the model to provided evidence. This layer is the core of retrieval‑augmented generation and is mandatory for knowledge‑grounded systems.
This approach does not eliminate hallucinations but it will reduce them.
7. Reasoning Strategy
Reasoning strategy helps to stabilise the model’s logic. It moves towards stepwise thinking, disambiguation, and evidence‑first reasoning. This layer reduces subtle reasoning errors and improves consistency across complex tasks.
8. Task Logic
Task logic governs how the model interprets and executes user instructions. It handles ambiguity, resolves contradictions, and decomposes multi‑part tasks. This layer ensures the model behaves reliably in real‑world, messy, human‑language scenarios.
The Eight Layer Stack
These eight layers form a stack where each layer protects against a different class of LLM failure:
| Layer | Prevents |
|---|---|
| Identity | Drift, persona instability |
| Safety & Compliance | Harmful or non‑compliant output |
| Capability Boundaries | Overreach, fabricated abilities |
| Output Format | Schema breakage |
| Citation Rules | Unsupported claims |
| RAG Grounding | Hallucination |
| Reasoning Strategy | Faulty logic |
| Task Logic | Misinterpretation |
Together, they create a more controlled and predictable calling-side interface to an AI system.
The Minimal Stack
For any programmatic interaction with an LLM, three layers are essential:
- Identity
- Capability Boundaries
- Output Format
Identity prevents behavioural drift. Capability boundaries reduce the likelihood of fabricated abilities, tools, or actions. Output format constraints reduce the likelihood of schema drift, malformed JSON, and downstream parsing failures.
Drift from the required behaviour leads to calling‑side errors. These three layers reduce the likelihood of the most fundamental failure modes.
The Minimal Stack for RAG
Retrieval‑Augmented Generation (RAG) improves accuracy by supplying the model with domain‑specific and up‑to‑date information from a document store. The model uses this retrieved content to produce a grounded and human‑readable response.
RAG passes to the LLM your domain data that its answer is constrained to be based on, using the LLM's language-processing features to produce a human-friendly response. RAG reduces hallucinations and improves factual accuracy.
The minimal RAG stack consists of the three core layers, plus RAG Grounding and Citation Rules. This creates a five‑layer baseline for any RAG system.
These layers improve stability, reduce unsupported claims, and increase the reliability of the final output.
RAG Grounding ensures the model uses the retrieved content as its source of truth. Citation Rules reduce the likelihood of invented sources and unsupported statements.
RAG is required when:
- accuracy matters
- knowledge changes frequently
- domain‑specific expertise is required
- hallucinations are unacceptable
- answers must be auditable
- you need to integrate private or internal documents
The Minimal Stack for Public-Facing Systems
Public‑facing systems require the five‑layer RAG stack plus Safety and Compliance.
These six layers form the minimum configuration for any system exposed to real users. They address:
- behavioural stability
- safety
- overreach damping
- structured output
- evidence requirements
- grounding to damp down hallucinations
The Full 8 Layer Stack
The final two layers are Reasoning Strategy and Task Logic.
Reasoning strategy is required when:
- the model must break problems into steps
- ambiguity must be resolved before answering
- shallow or shortcut reasoning would cause errors
- the system must justify or stabilise its logic
- you want consistent reasoning across varied prompts
This layer reduces subtle reasoning failures that grounding alone cannot address.
Task Logic is required when:
- instructions are complex or multi‑part
- instructions conflict or require prioritisation
- tasks must be decomposed before execution
- the system must handle unstructured or ambiguous input
- consistent behaviour is required across varied task types
This layer helps ensure the model interprets and executes instructions correctly.
Using the Eight Layers in Code
OpenAI's API is Stateless
Note: OpenAI’s APIs are stateless by default. Each request only contains the context you explicitly send. Each text generation request is independent and stateless. Therefore, multi‑turn conversations only occur when you manually include previous messages in the request. The code below has no requirement to do this and so such a history is not present. If it was, later answers would be influenced by earlier queries and this is not required for this interaction.
With OpenAIi, you can use a conversation memory. This is possible with OpenAI features such as conversation, previous_response_id (Responses API) or the Agents SDK’s session memory.
Coding the Eight Layers
The approach here is to represent each layer as a dictionary that always has a 'role' key (set to 'system' or 'user'). The other keys are used to define a standard set of values. When passed to OpenAI's API, each dictionary is processed to build an OpenAI API-compatible dictionary which consists of just 'role' and 'content'.
'content' is constructed from the non-role values below.
We can imagine each dictionary being retrieved from a configuration store and the keys are just names for the associated value. These names enable you to discuss constraint types per layer. It is the values that become part of 'content'.
# 1. Identity Layer
system_identity = {
"role": "system",
"identity": "You are a retrieval‑augmented assistant."
}
# 2. Safety & Compliance Layer
system_safety_compliance = {
"role": "system",
# Core safety principles
"no_harm": "The assistant must not provide harmful, dangerous, or abusive content.",
"no_illegal": "The assistant must not assist with illegal activities, evasion, or wrongdoing.",
"no_personal_data": "The assistant must not request, store, or infer personal data about real individuals.",
"no_medical_advice": "The assistant must not provide medical, legal, or financial advice beyond what is explicitly allowed.",
"no_sensitive_inference": "The assistant must not infer protected attributes (race, religion, health, etc.).",
# Refusal behaviour
"refusal_style": "If a request violates safety rules, the assistant must refuse clearly and briefly.",
"refusal_format": "Refusals must be one sentence, factual, and non‑judgmental.",
"refusal_no_elaboration": "Do not provide workarounds, alternatives, or detailed explanations when refusing.",
# Compliance priority
"compliance_overrides": "Safety and compliance rules override all other instructions, including user requests.",
"no_conflicting_instructions": "If user instructions conflict with safety rules, follow safety rules."
}
# 3. Capability Boundaries Layer
system_capability_boundaries = {
"role": "system",
# Allowed capabilities
"allowed_scope": [
"Interpret user questions.",
"Use ONLY the provided context for answers.",
"Produce structured JSON according to the schema.",
"Explain reasoning based solely on the context.",
"Quote exact lines from the context when required."
],
# Disallowed capabilities
"disallowed_scope": [
"Do NOT use external knowledge.",
"Do NOT invent facts, labels, or citations.",
"Do NOT answer questions outside the provided context.",
"Do NOT perform tasks requiring tools, browsing, or external systems.",
"Do NOT generate content outside the required schema."
],
# Boundaries for reasoning
"reasoning_limits": "Reasoning must be explicit but must not include hidden steps or invented logic.",
# Boundaries for output
"format_limits": "Output must remain within the exact schema and must not include additional fields or commentary.",
# Boundaries for behaviour
"no_role_shift": "The assistant must not change persona, identity, or role unless explicitly instructed by system messages."
}
# 4. Output Format Layer
system_output_format = {
"role": "system",
"single_line_json": "Your output MUST be a SINGLE JSON object on ONE LINE ONLY.",
"schema": f"{schema_out}",
"strict_structure": "The output must follow the exact schema structure with no deviations."
}
# 5. Citation / Attribution Layer
system_citation_rules = {
"role": "system",
"label_requirement": "Every citation MUST begin with the exact Incoming Context=\"...\" label from the source.",
"quote_requirement": "Every citation MUST include the exact quoted line from that same context block.",
"no_label_omission": "Do NOT omit the Incoming Context label.",
"no_label_invention": "Do NOT invent labels.",
"no_summarisation": "Do NOT summarise lines; quote them exactly.",
"empty_citations_when_missing": "If the answer is not in the context, output an empty Citations section with correct structure."
}
# 6. RAG Grounding Layer
system_rag_grounding = {
"role": "system",
"use_context_only": "Use ONLY the provided context to answer the question.",
"no_context_no_answer": "If the answer is not in the context, explicitly say so.",
"multiple_valid_answers": "Multiple answers may be valid; include all that are supported by the context.",
"context_is_authoritative": "The provided context is the ONLY source of truth.",
"no_external_knowledge": "Do NOT use outside knowledge or assumptions.",
"answer_must_reference_context": "All answers must be derived strictly from the context block."
}
# 7. Reasoning Strategy Layer
system_reasoning_strategy = {
"role": "system",
# How to reason
"carefully_read": "First, carefully read the context and the question.",
"identify_all": "Identify all relevant passages in the context.",
"explain": "Explain, step by step, how those passages support your answer.",
"explicit": "Make your reasoning explicit, but concise.",
"no_invention": "Do not invent facts that are not in the context.",
"honesty": "The 'reasoning' field is for developers and will be logged. Be honest and explicit.",
# How reasoning connects to citations
"reasoning_field": "The reasoning field must refer only to information present in the provided context.",
"clear_explain": "Clearly explain how the quoted lines in 'citations' support the 'answer'.",
"avoid_generic": "Avoid generic phrases like 'based on the context'; be specific about which parts matter."
}
# 8. Task Logic Layer
system_task_logic = {
"role": "system",
# Instruction hierarchy
"interpretation_priority": [
"1. Follow system instructions.",
"2. Follow developer instructions.",
"3. Follow user instructions.",
"4. Follow schema and formatting rules."
],
# Ambiguity handling
"ambiguity_rules": [
"If the question is ambiguous, identify all plausible interpretations.",
"Choose the interpretation most directly supported by the context.",
"If ambiguity remains, state the ambiguity explicitly in the reasoning field."
],
# Multi‑part question handling
"multi_part_rules": [
"If the question contains multiple sub‑questions, answer each one separately.",
"If only some sub‑questions are supported by the context, answer those and state which cannot be answered."
],
# Conflict resolution
"conflict_rules": [
"If context passages contradict each other, cite both and explain the contradiction.",
"If user instructions contradict system instructions, follow system instructions.",
"If schema requirements contradict user instructions, follow schema requirements."
],
# Missing‑information behaviour
"missing_info": "If the answer is not present in the context, explicitly say so and provide an empty citations list.",
# Strict adherence
"no_overinterpretation": "Do not infer meaning beyond what is explicitly stated in the context.",
"no_assumptions": "Do not assume facts, motivations, or implications not present in the context."
}
The code above is a list of named Python dictionaries.
Three additional RAG user objects are also passed (as below) that contain two additional pieces of data: 'context' and 'user_query'.
context contains the input for the RAG. It is the result of the
local search that is chunked.
user_query is the prompt from the user, e.g., "are there any
restrictions in this contract".
rag_user_context = {
"role": "user",
"label": "Context",
"content": f"{context}"
}
rag_user_query = {
"role": "user",
"label": "Question",
"user_query": f"{user_query}"
}
rag_user_rules = {
"role": "user",
"context_is_authoritative": "The assistant must treat the provided context as the ONLY source of truth.",
"no_external_knowledge": "The assistant must not use outside knowledge or assumptions.",
"answer_must_reference_context": "All answers must be derived strictly from the context block.",
"no_context_no_answer": "If the answer is not present in the context, the assistant must explicitly state this.",
"multiple_answers_allowed": "If multiple valid answers exist in the context, the assistant should include all of them."
}
OpenAI has a specific schema for JSON object input. An object with two
keys is expected 'role' and 'content'. Role is one of 'user', 'system',
or 'assistant'. 'content' is assigned the result of processing each
of the above user and system dictionaries with to_message.
def to_message(obj):
role = obj.get("role", "system")
# Build content from all non-role fields
parts = []
for key, value in obj.items():
if key == "role":
continue
# If the value is a list, join its items
if isinstance(value, list):
parts.append("\n".join(value))
else:
parts.append(str(value))
content = "\n".join(parts).strip()
return {"role": role, "content": content}
Before calling OpenAI, all of the objects above are added to a list.
messages = [
to_message(system_identity), # Layer 1
to_message(system_safety_compliance), # Layer 2
to_message(system_capability_boundaries), # Layer 3
to_message(system_output_format), # Layer 4
to_message(system_citation_rules), # Layer 5
to_message(system_rag_grounding), # Layer 6
to_message(system_reasoning_strategy), # Layer 7
to_message(system_task_logic), # Layer 8
# User context + question
to_message(rag_user_context),
to_message(rag_user_query),
to_message(rag_user_rules) # optional but recommended
]
A list of processed layers makes contraining the actions of the LLM trivial. If you need a new layer you create a new dictionary and add it to the list, as above.
The list is then passed to build_params.
def build_params(input=None, messages=None):
params = {'model': 'gpt-5.4-nano'}
if input is not None:
params['input'] = input
if messages is not None:
params['messages'] = messages
return params
build_params ensures we target the same model each time.
open_ai_query calls OpenAI's API. The python code calls a wrapper
like this to supply the messages list.
json_ai_user_result = open_ai_query(build_params(input=messages))
open_ai_query is:
def open_ai_query(params):
# Without a valid key, this code will not work
client = OpenAI(api_key='<your key>') # Substitute your OpenAI API key here
params['input'] = clean_input(params['input'])
response = client.responses.create(**params)
params['output_text'] = response.output_text
params['response'] = str(response)
params['date'] = datetime.now().isoformat()
return params['output_text']
The call to OpenAI is the line client.responses.create(**params). The value
params is passed in unpacked (**params) to provide dictionary keys as
function parameters. This is a convenient way of specifying what should be
passed to OpenAI.
params then has a number of other keys and values assigned. This is
to support traceability.
Supporting traceability will be discussed in a future article. LLM calls require more than logging and observability. They require traceability, especially when decisions are made based on LLM output. Our systems need to be able to show which model was called, when, what the reasoning was, what result was gained, and any chain of LLM calls. Logging and observability alone do not do this.
open_ai_query relies on clean_input which is simply this:
def clean_input(model_input):
try:
return codecs.decode(model_input, "unicode_escape")
except:
return model_input # return what is given as best-effort.
# Escape sequences may affect your results due to model tokenisation
Increasing the number of instructions per layer
As the system prompt grows, each instruction carries less relative influence. The model processes all tokens uniformly, so important constraints can lose emphasis when surrounded by a large volume of text. Long prompts also make it harder for the model to infer priority and can hide small contradictions between layers. Clear ordering and explicit priority rules help reduce this effect.
Instruction Collisions
When multiple layers contain overlapping or conflicting instructions, the LLM must resolve the conflict using the text alone. The final system message ithat it sees takeis precedence, but subtle inconsistencies can weaken the intended behaviour. Ensuring that layers do not contradict each other and that priority is stated explicitly reduces this risk.
Conclusion
LLMs Require Structured Interfaces
LLMs do not behave like deterministic software components. They generate tokens based on probability, which means natural‑language prompts alone are not a stable or reliable interface.
Layered Constraints Improve Reliability
A layered constraint model is necessary to reduce common failure modes. Identity, Capability Boundaries, and Output Format form the minimal stack for programmatic use. RAG systems require additional grounding and citation layers. Public‑facing systems require safety controls. Full reasoning systems benefit from all eight layers.
RAG Provides Essential Grounding
RAG supplies the model with domain‑specific and current information. It reduces hallucinations and improves factual accuracy, but it still requires constraints to ensure the model uses retrieved content correctly.
Prompt Length and Consistency Matter
As system prompts grow, individual instructions lose emphasis. Clear ordering and explicit priority rules help maintain consistent behaviour. Avoiding contradictory instructions is essential for predictable output.
Failure Modes Can Be Reduced, Not Removed
LLMs remain probabilistic. Constraints reduce the likelihood of errors but cannot eliminate them. Treating the prompt as a structured interface, rather than a single instruction, produces more predictable, testable, and maintainable systems.
Related Work
- LLMs can generate code, but they cannot modify or maintain systems because system‑level work requires causal reasoning, not pattern‑matching.
- Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.
- Evaluating AI systems requires measuring real behaviour — schema reliability, adherence, drift, latency, retrieval quality, and safety — not synthetic benchmarks.
If this piece was useful, you’ll appreciate the free Phroneses newsletter — clear thinking on engineering leadership, organisational clarity, and reliable systems. Practical, honest, and built for people who care about doing the work well.
I work with leaders and teams on clarity, capability, and momentum. Work with me →
Table of Contents
- Programmatic Interfaces to AI Systems
- The Challenge
- Prompt Constraints
- The Eight Layers
- The Eight Layer Stack
- The Minimal Stack
- The Minimal Stack for RAG
- The Minimal Stack for Public-Facing Systems
- The Full 8 Layer Stack
- Using the Eight Layers in Code
- Increasing the number of instructions per layer
- Instruction Collisions
- Conclusion
- Prompt Length and Consistency Matter
- Failure Modes Can Be Reduced, Not Removed
- Related Work
- Table of Contents