Everyone has used ChatGPT. Almost every engineering team is now being asked to "add AI" to something. But there's a wide gap between using an LLM and integrating one into a system that runs reliably in production.
This post is about that gap.
I've spent the past year building and maintaining systems that rely on LLMs — pipelines with structured outputs, RAG-backed knowledge bases, function-calling agents, and multi-model routing layers. The demo always looks clean. Production rarely does.
Here's what I've learned.
---
The Integration Patterns You'll Actually Use
Before writing a single line of code, you need to understand how LLMs fit into your system. There are four dominant patterns, and choosing the wrong one early is expensive.
1. Completion — Direct Prompt, Direct Output
The simplest pattern. You send a prompt, you get text back. Good for tasks where the output is inherently unstructured — drafting, summarizing, classifying, translating.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Summarize this support ticket in one sentence: ..."}
]
)
print(response.content[0].text)
The pitfall: you'll be tempted to use this for everything. Don't. As soon as downstream code needs to act on the output, you need structured output — not string parsing.
2. Structured Output — LLMs as Data Extractors
When the output feeds into a database, an API call, or a conditional branch, you need deterministic structure. Use Pydantic models and force the LLM to produce JSON that matches a schema.
from anthropic import Anthropic
from pydantic import BaseModel
import json
client = Anthropic()
class TicketClassification(BaseModel):
category: str
priority: str # low | medium | high | critical
summary: str
requires_escalation: bool
def classify_ticket(ticket_text: str) -> TicketClassification:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="""You are a support ticket classifier. Always respond with valid JSON
matching this schema exactly: {"category": string, "priority": string,
"summary": string, "requires_escalation": boolean}""",
messages=[{"role": "user", "content": ticket_text}]
)
raw = response.content[0].text
return TicketClassification(**json.loads(raw))
Add retry logic here. LLMs occasionally produce malformed JSON — especially under high load or with long context. Validate, retry once with a corrective prompt, then fail loudly.
3. RAG — Retrieval-Augmented Generation
When the model needs to reason over your private data — documents, knowledge bases, support histories — you can't fine-tune on every update. RAG is the practical answer: retrieve relevant context at query time, inject it into the prompt.
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def retrieve_context(query: str, chunks: list[str], top_k: int = 5) -> list[str]:
query_vec = embedder.encode([query])[0]
chunk_vecs = embedder.encode(chunks)
scores = np.dot(chunk_vecs, query_vec)
top_indices = np.argsort(scores)[-top_k:][::-1]
return [chunks[i] for i in top_indices]
def answer_with_context(query: str, knowledge_chunks: list[str]) -> str:
context = retrieve_context(query, knowledge_chunks)
context_block = "\n\n".join(f"[Source {i+1}]: {c}" for i, c in enumerate(context))
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="Answer questions using only the provided sources. Cite source numbers. Say 'I don't know' if the answer isn't in the sources.",
messages=[
{"role": "user", "content": f"Sources:\n{context_block}\n\nQuestion: {query}"}
]
)
return response.content[0].text
The quality of a RAG system lives in the retrieval layer, not the generation layer. Poor chunking strategy, weak embeddings, or a shallow vector store will produce hallucinations that look confident. That's worse than no AI at all.
4. Tool Use — LLMs as Orchestrators
The most powerful pattern, and the most complex. The model decides what actions to take, calls your functions, observes results, and continues reasoning. This is the foundation of agents.
from anthropic import Anthropic
import json
client = Anthropic()
tools = [
{
"name": "get_order_status",
"description": "Fetch current status of a customer order by order ID.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "The order identifier"}
},
"required": ["order_id"]
}
},
{
"name": "issue_refund",
"description": "Issue a partial or full refund for an order.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"amount": {"type": "number", "description": "Refund amount in USD"},
"reason": {"type": "string"}
},
"required": ["order_id", "amount", "reason"]
}
}
]
def handle_tool_call(tool_name: str, tool_input: dict) -> str:
if tool_name == "get_order_status":
return json.dumps({"status": "shipped", "eta": "2026-05-08"})
if tool_name == "issue_refund":
return json.dumps({"success": True, "refund_id": "ref_abc123"})
return json.dumps({"error": "unknown tool"})
def run_support_agent(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
return next(b.text for b in response.content if hasattr(b, "text"))
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = handle_tool_call(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
Design your tools with the principle of least privilege. The model calls what you expose. Exposing a raw database query tool to an LLM is as dangerous as exposing it to an unauthenticated API endpoint.
---
The Production Problems No One Demos
Latency Is a First-Class Concern
LLM calls are slow by HTTP standards — 1 to 10 seconds for a typical response, more for long outputs. If you're blocking a user request on a synchronous LLM call, you're building a bad product.
Patterns that help:
stream=True and pipe tokens to the client as they arrive. The perceived latency drops dramatically even if total time doesn't.asyncio and an async client.import hashlib
import redis
from anthropic import AsyncAnthropic
cache = redis.Redis()
client = AsyncAnthropic()
def prompt_cache_key(system: str, user: str) -> str:
return hashlib.sha256(f"{system}::{user}".encode()).hexdigest()
async def cached_complete(system: str, user: str, ttl: int = 3600) -> str:
key = prompt_cache_key(system, user)
cached = cache.get(key)
if cached:
return cached.decode()
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user}]
)
result = response.content[0].text
cache.setex(key, ttl, result)
return result
Cost Compounds Fast
At scale, token costs are real. A few practices that matter:
Reliability and Retry Strategy
LLMs are probabilistic infrastructure. They time out. They return malformed outputs. They occasionally just... refuse. Build for this.
import asyncio
from anthropic import AsyncAnthropic, APITimeoutError, APIError
client = AsyncAnthropic()
async def resilient_call(prompt: str, retries: int = 3) -> str:
for attempt in range(retries):
try:
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
timeout=30.0
)
return response.content[0].text
except APITimeoutError:
if attempt == retries - 1:
raise
await asyncio.sleep(2 ** attempt)
except APIError as e:
if e.status_code in (429, 529): # rate limit / overloaded
await asyncio.sleep(2 ** attempt)
else:
raise
raise RuntimeError("LLM call failed after retries")
Exponential backoff with jitter on 429s. Hard timeout on every call. Never let an LLM call block indefinitely.
Observability
If you can't see what your model is doing in production, you're operating blind. At minimum, log:
Structure these as events, not text logs. You'll want to query them.
import time
import logging
logger = logging.getLogger("llm.calls")
async def observed_call(prompt: str, system: str = "") -> str:
start = time.monotonic()
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
)
elapsed = time.monotonic() - start
logger.info({
"event": "llm_call",
"model": response.model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": round(elapsed * 1000),
"stop_reason": response.stop_reason,
})
return response.content[0].text
---
When Not to Use an LLM
This matters as much as knowing when to use one.
Don't use an LLM when a rule will do. If you can write an if statement that handles the case correctly, write the if statement. It's faster, cheaper, predictable, and testable.
Don't use an LLM for tasks that require exact recall. LLMs hallucinate. If a user asks for the exact price of an item, query your database. Don't ask a model that might confabulate a plausible-sounding number.
Don't use an LLM in your hot path without caching. Adding 3 seconds of LLM latency to an endpoint that used to return in 50ms is a regression, even if the output is impressive.
Don't use an LLM when you don't have evaluation. If you can't measure whether the model's output is good, you can't improve it and you can't catch regressions. Build evals before you ship, not after.
---
The Eval Problem
Evaluation is the unsolved hard problem of AI engineering. You need it. Most teams skip it. Then they push a prompt change and something breaks in production two weeks later with no one knowing why.
At minimum, build a regression suite of representative inputs with expected outputs and run it on every prompt change.
from dataclasses import dataclass
from typing import Callable
@dataclass
class EvalCase:
input: str
expected_category: str
def run_eval(cases: list[EvalCase], model_fn: Callable[[str], str]) -> float:
correct = 0
for case in cases:
result = model_fn(case.input)
if case.expected_category.lower() in result.lower():
correct += 1
return correct / len(cases)
# Run this in CI before any prompt or model change ships
For open-ended outputs where correctness isn't binary, use an LLM-as-judge pattern — a secondary model that scores outputs against a rubric. It's imperfect, but it's better than nothing.
---
The Architecture That Scales
When LLM features start multiplying across a codebase, you need a thin abstraction layer — not a framework, just a seam.
Centralize:
Keep prompt logic in the feature code. The abstraction layer should not know about your domain; your domain code should not know about retry backoff.
The goal is to make swapping models, tweaking timeouts, or adding a cache transparent to feature teams. You will need all three of those things. The abstraction costs one afternoon and saves weeks later.
---
Final Thought
LLM integration is a genuine engineering discipline. It has reliability concerns, cost dynamics, latency constraints, and evaluation challenges that don't exist in conventional software. Most of them are solvable with the same tools engineers already use — good abstraction, logging, caching, retries, and tests.
What's different is the nondeterminism. You're shipping a system whose behavior you can influence but not fully control. That demands more rigor around evaluation and observability, not less.
Build as if the model will occasionally lie, time out, or refuse. Because it will.