Home / Blog / AI & ML
AI & ML

LLM Integration in Production — What Nobody Tells You

Integrating large language models into a real system is a different discipline than using them via a chat interface. Here's what that actually looks like from an engineering standpoint.

Yudi Nugraha
May 6, 2026
10 min read
Featured

Everyone has used ChatGPT. Almost every engineering team is now being asked to "add AI" to something. But there's a wide gap between using an LLM and integrating one into a system that runs reliably in production.

This post is about that gap.

I've spent the past year building and maintaining systems that rely on LLMs — pipelines with structured outputs, RAG-backed knowledge bases, function-calling agents, and multi-model routing layers. The demo always looks clean. Production rarely does.

Here's what I've learned.

---

The Integration Patterns You'll Actually Use

Before writing a single line of code, you need to understand how LLMs fit into your system. There are four dominant patterns, and choosing the wrong one early is expensive.

1. Completion — Direct Prompt, Direct Output

The simplest pattern. You send a prompt, you get text back. Good for tasks where the output is inherently unstructured — drafting, summarizing, classifying, translating.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Summarize this support ticket in one sentence: ..."}
    ]
)

print(response.content[0].text)

The pitfall: you'll be tempted to use this for everything. Don't. As soon as downstream code needs to act on the output, you need structured output — not string parsing.

2. Structured Output — LLMs as Data Extractors

When the output feeds into a database, an API call, or a conditional branch, you need deterministic structure. Use Pydantic models and force the LLM to produce JSON that matches a schema.

from anthropic import Anthropic
from pydantic import BaseModel
import json

client = Anthropic()

class TicketClassification(BaseModel):
    category: str
    priority: str  # low | medium | high | critical
    summary: str
    requires_escalation: bool

def classify_ticket(ticket_text: str) -> TicketClassification:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="""You are a support ticket classifier. Always respond with valid JSON 
matching this schema exactly: {"category": string, "priority": string, 
"summary": string, "requires_escalation": boolean}""",
        messages=[{"role": "user", "content": ticket_text}]
    )

    raw = response.content[0].text
    return TicketClassification(**json.loads(raw))

Add retry logic here. LLMs occasionally produce malformed JSON — especially under high load or with long context. Validate, retry once with a corrective prompt, then fail loudly.

3. RAG — Retrieval-Augmented Generation

When the model needs to reason over your private data — documents, knowledge bases, support histories — you can't fine-tune on every update. RAG is the practical answer: retrieve relevant context at query time, inject it into the prompt.

from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

def retrieve_context(query: str, chunks: list[str], top_k: int = 5) -> list[str]:
    query_vec = embedder.encode([query])[0]
    chunk_vecs = embedder.encode(chunks)
    scores = np.dot(chunk_vecs, query_vec)
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [chunks[i] for i in top_indices]

def answer_with_context(query: str, knowledge_chunks: list[str]) -> str:
    context = retrieve_context(query, knowledge_chunks)
    context_block = "\n\n".join(f"[Source {i+1}]: {c}" for i, c in enumerate(context))

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="Answer questions using only the provided sources. Cite source numbers. Say 'I don't know' if the answer isn't in the sources.",
        messages=[
            {"role": "user", "content": f"Sources:\n{context_block}\n\nQuestion: {query}"}
        ]
    )
    return response.content[0].text

The quality of a RAG system lives in the retrieval layer, not the generation layer. Poor chunking strategy, weak embeddings, or a shallow vector store will produce hallucinations that look confident. That's worse than no AI at all.

4. Tool Use — LLMs as Orchestrators

The most powerful pattern, and the most complex. The model decides what actions to take, calls your functions, observes results, and continues reasoning. This is the foundation of agents.

from anthropic import Anthropic
import json

client = Anthropic()

tools = [
    {
        "name": "get_order_status",
        "description": "Fetch current status of a customer order by order ID.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "The order identifier"}
            },
            "required": ["order_id"]
        }
    },
    {
        "name": "issue_refund",
        "description": "Issue a partial or full refund for an order.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "amount": {"type": "number", "description": "Refund amount in USD"},
                "reason": {"type": "string"}
            },
            "required": ["order_id", "amount", "reason"]
        }
    }
]

def handle_tool_call(tool_name: str, tool_input: dict) -> str:
    if tool_name == "get_order_status":
        return json.dumps({"status": "shipped", "eta": "2026-05-08"})
    if tool_name == "issue_refund":
        return json.dumps({"success": True, "refund_id": "ref_abc123"})
    return json.dumps({"error": "unknown tool"})

def run_support_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if hasattr(b, "text"))

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = handle_tool_call(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

Design your tools with the principle of least privilege. The model calls what you expose. Exposing a raw database query tool to an LLM is as dangerous as exposing it to an unauthenticated API endpoint.

---

The Production Problems No One Demos

Latency Is a First-Class Concern

LLM calls are slow by HTTP standards — 1 to 10 seconds for a typical response, more for long outputs. If you're blocking a user request on a synchronous LLM call, you're building a bad product.

Patterns that help:

  • Streaming: Use stream=True and pipe tokens to the client as they arrive. The perceived latency drops dramatically even if total time doesn't.
  • Async first: Never run LLM calls synchronously on a thread that could serve other requests. Use asyncio and an async client.
  • Caching: Identical prompts should hit a cache, not the model. A Redis layer with prompt hashing handles this cheaply.
  • import hashlib
    import redis
    from anthropic import AsyncAnthropic
    
    cache = redis.Redis()
    client = AsyncAnthropic()
    
    def prompt_cache_key(system: str, user: str) -> str:
        return hashlib.sha256(f"{system}::{user}".encode()).hexdigest()
    
    async def cached_complete(system: str, user: str, ttl: int = 3600) -> str:
        key = prompt_cache_key(system, user)
        cached = cache.get(key)
        if cached:
            return cached.decode()
    
        response = await client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": user}]
        )
        result = response.content[0].text
        cache.setex(key, ttl, result)
        return result
    

    Cost Compounds Fast

    At scale, token costs are real. A few practices that matter:

  • Prompt compression: Trim system prompts ruthlessly. Every token in a repeated system prompt is multiplied by your request volume.
  • Model routing: Use a smaller, cheaper model for tasks that don't need full capability — classification, intent detection, entity extraction. Reserve the larger model for reasoning-heavy tasks.
  • Context window discipline: Don't stuff the entire conversation history into every request. Summarize older turns. The LLM doesn't need perfect recall of 50 messages to answer the current question.
  • Reliability and Retry Strategy

    LLMs are probabilistic infrastructure. They time out. They return malformed outputs. They occasionally just... refuse. Build for this.

    import asyncio
    from anthropic import AsyncAnthropic, APITimeoutError, APIError
    
    client = AsyncAnthropic()
    
    async def resilient_call(prompt: str, retries: int = 3) -> str:
        for attempt in range(retries):
            try:
                response = await client.messages.create(
                    model="claude-sonnet-4-6",
                    max_tokens=1024,
                    messages=[{"role": "user", "content": prompt}],
                    timeout=30.0
                )
                return response.content[0].text
            except APITimeoutError:
                if attempt == retries - 1:
                    raise
                await asyncio.sleep(2 ** attempt)
            except APIError as e:
                if e.status_code in (429, 529):  # rate limit / overloaded
                    await asyncio.sleep(2 ** attempt)
                else:
                    raise
        raise RuntimeError("LLM call failed after retries")
    

    Exponential backoff with jitter on 429s. Hard timeout on every call. Never let an LLM call block indefinitely.

    Observability

    If you can't see what your model is doing in production, you're operating blind. At minimum, log:

  • Full prompt and response for every LLM call
  • Input and output token counts
  • Latency per call
  • Model and version used
  • Whether the call hit a cache
  • Any retry attempts
  • Structure these as events, not text logs. You'll want to query them.

    import time
    import logging
    
    logger = logging.getLogger("llm.calls")
    
    async def observed_call(prompt: str, system: str = "") -> str:
        start = time.monotonic()
        response = await client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": prompt}]
        )
        elapsed = time.monotonic() - start
    
        logger.info({
            "event": "llm_call",
            "model": response.model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "latency_ms": round(elapsed * 1000),
            "stop_reason": response.stop_reason,
        })
    
        return response.content[0].text
    

    ---

    When Not to Use an LLM

    This matters as much as knowing when to use one.

    Don't use an LLM when a rule will do. If you can write an if statement that handles the case correctly, write the if statement. It's faster, cheaper, predictable, and testable.

    Don't use an LLM for tasks that require exact recall. LLMs hallucinate. If a user asks for the exact price of an item, query your database. Don't ask a model that might confabulate a plausible-sounding number.

    Don't use an LLM in your hot path without caching. Adding 3 seconds of LLM latency to an endpoint that used to return in 50ms is a regression, even if the output is impressive.

    Don't use an LLM when you don't have evaluation. If you can't measure whether the model's output is good, you can't improve it and you can't catch regressions. Build evals before you ship, not after.

    ---

    The Eval Problem

    Evaluation is the unsolved hard problem of AI engineering. You need it. Most teams skip it. Then they push a prompt change and something breaks in production two weeks later with no one knowing why.

    At minimum, build a regression suite of representative inputs with expected outputs and run it on every prompt change.

    from dataclasses import dataclass
    from typing import Callable
    
    @dataclass
    class EvalCase:
        input: str
        expected_category: str
    
    def run_eval(cases: list[EvalCase], model_fn: Callable[[str], str]) -> float:
        correct = 0
        for case in cases:
            result = model_fn(case.input)
            if case.expected_category.lower() in result.lower():
                correct += 1
        return correct / len(cases)
    
    # Run this in CI before any prompt or model change ships
    

    For open-ended outputs where correctness isn't binary, use an LLM-as-judge pattern — a secondary model that scores outputs against a rubric. It's imperfect, but it's better than nothing.

    ---

    The Architecture That Scales

    When LLM features start multiplying across a codebase, you need a thin abstraction layer — not a framework, just a seam.

    Centralize:

  • Client initialization and configuration
  • Retry and timeout logic
  • Logging and tracing
  • Model selection by task type
  • Caching
  • Keep prompt logic in the feature code. The abstraction layer should not know about your domain; your domain code should not know about retry backoff.

    The goal is to make swapping models, tweaking timeouts, or adding a cache transparent to feature teams. You will need all three of those things. The abstraction costs one afternoon and saves weeks later.

    ---

    Final Thought

    LLM integration is a genuine engineering discipline. It has reliability concerns, cost dynamics, latency constraints, and evaluation challenges that don't exist in conventional software. Most of them are solvable with the same tools engineers already use — good abstraction, logging, caching, retries, and tests.

    What's different is the nondeterminism. You're shipping a system whose behavior you can influence but not fully control. That demands more rigor around evaluation and observability, not less.

    Build as if the model will occasionally lie, time out, or refuse. Because it will.

    Tags

    LLMAI EngineeringPythonBackendArchitecture
    Y

    Yudi Nugraha

    Software Engineer | Builder

    More Articles

    Explore more articles on similar topics

    View All Articles