Yudi Nugraha

Everyone has used ChatGPT. Almost every engineering team is now being asked to "add AI" to something. But there's a wide gap between using an LLM and integrating one into a system that runs reliably in production.

This post is about that gap.

I've spent the past year building and maintaining systems that rely on LLMs — pipelines with structured outputs, RAG-backed knowledge bases, function-calling agents, and multi-model routing layers. The demo always looks clean. Production rarely does.

Here's what I've learned.

---

The Integration Patterns You'll Actually Use

Before writing a single line of code, you need to understand how LLMs fit into your system. There are four dominant patterns, and choosing the wrong one early is expensive.

1. Completion — Direct Prompt, Direct Output

The simplest pattern. You send a prompt, you get text back. Good for tasks where the output is inherently unstructured — drafting, summarizing, classifying, translating.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Summarize this support ticket in one sentence: ..."}
    ]
)

print(response.content[0].text)

The pitfall: you'll be tempted to use this for everything. Don't. As soon as downstream code needs to act on the output, you need structured output — not string parsing.

2. Structured Output — LLMs as Data Extractors

When the output feeds into a database, an API call, or a conditional branch, you need deterministic structure. Use Pydantic models and force the LLM to produce JSON that matches a schema.

from anthropic import Anthropic
from pydantic import BaseModel
import json

client = Anthropic()

class TicketClassification(BaseModel):
    category: str
    priority: str  # low | medium | high | critical
    summary: str
    requires_escalation: bool

def classify_ticket(ticket_text: str) -> TicketClassification:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="""You are a support ticket classifier. Always respond with valid JSON 
matching this schema exactly: {"category": string, "priority": string, 
"summary": string, "requires_escalation": boolean}""",
        messages=[{"role": "user", "content": ticket_text}]
    )

    raw = response.content[0].text
    return TicketClassification(**json.loads(raw))

Add retry logic here. LLMs occasionally produce malformed JSON — especially under high load or with long context. Validate, retry once with a corrective prompt, then fail loudly.

3. RAG — Retrieval-Augmented Generation

When the model needs to reason over your private data — documents, knowledge bases, support histories — you can't fine-tune on every update. RAG is the practical answer: retrieve relevant context at query time, inject it into the prompt.

from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

def retrieve_context(query: str, chunks: list[str], top_k: int = 5) -> list[str]:
    query_vec = embedder.encode([query])[0]
    chunk_vecs = embedder.encode(chunks)
    scores = np.dot(chunk_vecs, query_vec)
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [chunks[i] for i in top_indices]

def answer_with_context(query: str, knowledge_chunks: list[str]) -> str:
    context = retrieve_context(query, knowledge_chunks)
    context_block = "\n\n".join(f"[Source {i+1}]: {c}" for i, c in enumerate(context))

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="Answer questions using only the provided sources. Cite source numbers. Say 'I don't know' if the answer isn't in the sources.",
        messages=[
            {"role": "user", "content": f"Sources:\n{context_block}\n\nQuestion: {query}"}
        ]
    )
    return response.content[0].text

The quality of a RAG system lives in the retrieval layer, not the generation layer. Poor chunking strategy, weak embeddings, or a shallow vector store will produce hallucinations that look confident. That's worse than no AI at all.

4. Tool Use — LLMs as Orchestrators

The most powerful pattern, and the most complex. The model decides what actions to take, calls your functions, observes results, and continues reasoning. This is the foundation of agents.

from anthropic import Anthropic
import json

client = Anthropic()

tools = [
    {
        "name": "get_order_status",
        "description": "Fetch current status of a customer order by order ID.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "The order identifier"}
            },
            "required": ["order_id"]
        }
    },
    {
        "name": "issue_refund",
        "description": "Issue a partial or full refund for an order.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "amount": {"type": "number", "description": "Refund amount in USD"},
                "reason": {"type": "string"}
            },
            "required": ["order_id", "amount", "reason"]
        }
    }
]

def handle_tool_call(tool_name: str, tool_input: dict) -> str:
    if tool_name == "get_order_status":
        return json.dumps({"status": "shipped", "eta": "2026-05-08"})
    if tool_name == "issue_refund":
        return json.dumps({"success": True, "refund_id": "ref_abc123"})
    return json.dumps({"error": "unknown tool"})

def run_support_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if hasattr(b, "text"))

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = handle_tool_call(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

Design your tools with the principle of least privilege. The model calls what you expose. Exposing a raw database query tool to an LLM is as dangerous as exposing it to an unauthenticated API endpoint.

---

The Production Problems No One Demos

Latency Is a First-Class Concern

LLM calls are slow by HTTP standards — 1 to 10 seconds for a typical response, more for long outputs. If you're blocking a user request on a synchronous LLM call, you're building a bad product.

Patterns that help:

Streaming: Use stream=True and pipe tokens to the client as they arrive. The perceived latency drops dramatically even if total time doesn't.

Async first: Never run LLM calls synchronously on a thread that could serve other requests. Use asyncio and an async client.

Caching: Identical prompts should hit a cache, not the model. A Redis layer with prompt hashing handles this cheaply.

import hashlib
import redis
from anthropic import AsyncAnthropic

cache = redis.Redis()
client = AsyncAnthropic()

def prompt_cache_key(system: str, user: str) -> str:
    return hashlib.sha256(f"{system}::{user}".encode()).hexdigest()

async def cached_complete(system: str, user: str, ttl: int = 3600) -> str:
    key = prompt_cache_key(system, user)
    cached = cache.get(key)
    if cached:
        return cached.decode()

    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": user}]
    )
    result = response.content[0].text
    cache.setex(key, ttl, result)
    return result

Cost Compounds Fast

At scale, token costs are real. A few practices that matter:

Prompt compression: Trim system prompts ruthlessly. Every token in a repeated system prompt is multiplied by your request volume.

Model routing: Use a smaller, cheaper model for tasks that don't need full capability — classification, intent detection, entity extraction. Reserve the larger model for reasoning-heavy tasks.

Context window discipline: Don't stuff the entire conversation history into every request. Summarize older turns. The LLM doesn't need perfect recall of 50 messages to answer the current question.

Reliability and Retry Strategy

LLMs are probabilistic infrastructure. They time out. They return malformed outputs. They occasionally just... refuse. Build for this.

import asyncio
from anthropic import AsyncAnthropic, APITimeoutError, APIError

client = AsyncAnthropic()

async def resilient_call(prompt: str, retries: int = 3) -> str:
    for attempt in range(retries):
        try:
            response = await client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}],
                timeout=30.0
            )
            return response.content[0].text
        except APITimeoutError:
            if attempt == retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)
        except APIError as e:
            if e.status_code in (429, 529):  # rate limit / overloaded
                await asyncio.sleep(2 ** attempt)
            else:
                raise
    raise RuntimeError("LLM call failed after retries")

Exponential backoff with jitter on 429s. Hard timeout on every call. Never let an LLM call block indefinitely.

Observability

If you can't see what your model is doing in production, you're operating blind. At minimum, log:

Full prompt and response for every LLM call

Input and output token counts

Latency per call

Model and version used

Whether the call hit a cache

Any retry attempts

Structure these as events, not text logs. You'll want to query them.

import time
import logging

logger = logging.getLogger("llm.calls")

async def observed_call(prompt: str, system: str = "") -> str:
    start = time.monotonic()
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    elapsed = time.monotonic() - start

    logger.info({
        "event": "llm_call",
        "model": response.model,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "latency_ms": round(elapsed * 1000),
        "stop_reason": response.stop_reason,
    })

    return response.content[0].text

---

When Not to Use an LLM

This matters as much as knowing when to use one.

Don't use an LLM when a rule will do. If you can write an if statement that handles the case correctly, write the if statement. It's faster, cheaper, predictable, and testable.

Don't use an LLM for tasks that require exact recall. LLMs hallucinate. If a user asks for the exact price of an item, query your database. Don't ask a model that might confabulate a plausible-sounding number.

Don't use an LLM in your hot path without caching. Adding 3 seconds of LLM latency to an endpoint that used to return in 50ms is a regression, even if the output is impressive.

Don't use an LLM when you don't have evaluation. If you can't measure whether the model's output is good, you can't improve it and you can't catch regressions. Build evals before you ship, not after.

---

The Eval Problem

Evaluation is the unsolved hard problem of AI engineering. You need it. Most teams skip it. Then they push a prompt change and something breaks in production two weeks later with no one knowing why.

At minimum, build a regression suite of representative inputs with expected outputs and run it on every prompt change.

from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalCase:
    input: str
    expected_category: str

def run_eval(cases: list[EvalCase], model_fn: Callable[[str], str]) -> float:
    correct = 0
    for case in cases:
        result = model_fn(case.input)
        if case.expected_category.lower() in result.lower():
            correct += 1
    return correct / len(cases)

# Run this in CI before any prompt or model change ships

For open-ended outputs where correctness isn't binary, use an LLM-as-judge pattern — a secondary model that scores outputs against a rubric. It's imperfect, but it's better than nothing.

---

The Architecture That Scales

When LLM features start multiplying across a codebase, you need a thin abstraction layer — not a framework, just a seam.

Centralize:

Client initialization and configuration

Retry and timeout logic

Logging and tracing

Model selection by task type

Caching

Keep prompt logic in the feature code. The abstraction layer should not know about your domain; your domain code should not know about retry backoff.

The goal is to make swapping models, tweaking timeouts, or adding a cache transparent to feature teams. You will need all three of those things. The abstraction costs one afternoon and saves weeks later.

---

Final Thought

LLM integration is a genuine engineering discipline. It has reliability concerns, cost dynamics, latency constraints, and evaluation challenges that don't exist in conventional software. Most of them are solvable with the same tools engineers already use — good abstraction, logging, caching, retries, and tests.

What's different is the nondeterminism. You're shipping a system whose behavior you can influence but not fully control. That demands more rigor around evaluation and observability, not less.

Build as if the model will occasionally lie, time out, or refuse. Because it will.

LLM Integration in Production — What Nobody Tells You