Home / Blog / AI & ML
AI & ML

Building RAG Systems That Actually Work in Production

Retrieval-Augmented Generation sounds simple — retrieve context, inject it, generate. The production reality is more complex. Here's a field guide to building RAG pipelines that are accurate, fast, and maintainable.

Yudi Nugraha
May 6, 2026
12 min read

Retrieval-Augmented Generation is the right answer to a specific problem: you have an LLM that knows how to reason, and private data that changes too often to fine-tune on. RAG connects them at query time — retrieve what's relevant, hand it to the model, let the model reason over it.

The concept fits in one sentence. The implementation does not.

Most teams underestimate the pipeline complexity and overestimate the model's ability to compensate for retrieval failures. The model is the last mile. If retrieval is broken, the model hallucinates confidently. The user never knows the difference. That's the failure mode that makes RAG dangerous to ship without rigorous engineering.

This post is about building the retrieval layer correctly.

---

The RAG Pipeline — What's Actually Happening

A production RAG system has two distinct phases that run at different times.

Indexing (offline, runs on document ingestion):

  • Load raw documents from source
  • Clean and normalize the text
  • Chunk documents into retrieval units
  • Embed each chunk into a vector representation
  • Store chunks and vectors in a vector database
  • Retrieval + Generation (online, runs per query):

  • Embed the user query
  • Search the vector database for similar chunks
  • Re-rank and filter retrieved chunks
  • Build a context-injected prompt
  • Call the LLM and return the response
  • Every step has failure modes. Most RAG problems trace back to steps 3 or 2 — chunking strategy and retrieval quality. Engineers tend to focus on step 5 first. That's backwards.

    ---

    Chunking — The Decision That Determines Everything

    Chunking is where most RAG systems fail silently. A poor chunking strategy produces chunks that are either too small (missing context) or too large (burying relevant signal in noise). Both cause retrieval failures that look like model failures.

    Fixed-Size Chunking — Simple but Fragile

    def chunk_fixed(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
        tokens = text.split()
        chunks = []
        start = 0
        while start < len(tokens):
            end = min(start + chunk_size, len(tokens))
            chunks.append(" ".join(tokens[start:end]))
            start += chunk_size - overlap
        return chunks
    

    The overlap preserves context across boundaries. Without it, a sentence split across two chunks loses coherence in both. Fixed-size chunking is the baseline — it works for homogeneous text (logs, product descriptions) but struggles with structured documents where a paragraph boundary mid-chunk loses the surrounding concept.

    Semantic Chunking — Split on Meaning, Not Token Count

    Better approach: split at natural semantic boundaries. Sentences whose embedding shifts significantly from the previous sentence signal a topic boundary.

    from sentence_transformers import SentenceTransformer
    import numpy as np
    
    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    
    def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
    
    def chunk_semantic(
        text: str,
        similarity_threshold: float = 0.75,
        max_chunk_tokens: int = 400
    ) -> list[str]:
        sentences = [s.strip() for s in text.split(".") if s.strip()]
        if not sentences:
            return []
    
        embeddings = embedder.encode(sentences)
        chunks, current = [], [sentences[0]]
    
        for i in range(1, len(sentences)):
            sim = cosine_similarity(embeddings[i - 1], embeddings[i])
            current_len = sum(len(s.split()) for s in current)
    
            if sim < similarity_threshold or current_len >= max_chunk_tokens:
                chunks.append(". ".join(current) + ".")
                current = [sentences[i]]
            else:
                current.append(sentences[i])
    
        if current:
            chunks.append(". ".join(current) + ".")
    
        return chunks
    

    This costs more at indexing time but produces chunks aligned with the document's structure. The retrieved chunk is more likely to contain the full thought the query is asking about.

    Hierarchical Chunking — Index Multiple Granularities

    For long, structured documents (legal contracts, technical manuals, API docs), index at multiple granularities: the full document summary, section-level chunks, and paragraph-level chunks. At retrieval time, match at the paragraph level but return the surrounding section as context.

    from dataclasses import dataclass, field
    
    @dataclass
    class HierarchicalChunk:
        id: str
        text: str
        level: str          # "document" | "section" | "paragraph"
        parent_id: str | None = None
        metadata: dict = field(default_factory=dict)
    
    def build_hierarchy(document: dict) -> list[HierarchicalChunk]:
        chunks = []
        doc_id = document["id"]
    
        chunks.append(HierarchicalChunk(
            id=doc_id,
            text=document["summary"],
            level="document",
            parent_id=None,
            metadata={"title": document["title"]}
        ))
    
        for i, section in enumerate(document.get("sections", [])):
            section_id = f"{doc_id}::s{i}"
            chunks.append(HierarchicalChunk(
                id=section_id,
                text=section["content"][:600],
                level="section",
                parent_id=doc_id,
                metadata={"section_title": section["title"]}
            ))
    
            for j, para in enumerate(section.get("paragraphs", [])):
                chunks.append(HierarchicalChunk(
                    id=f"{section_id}::p{j}",
                    text=para,
                    level="paragraph",
                    parent_id=section_id,
                    metadata={"section_title": section["title"]}
                ))
    
        return chunks
    

    Retrieve at paragraph level, then walk up the parent chain to include section-level context in the prompt. The model gets a precise match and the surrounding framing.

    ---

    Embedding — Choosing and Using the Right Model

    The embedding model determines what "similar" means in your vector space. A general-purpose embedder trained on web text will not perform the same as a domain-specific one on medical, legal, or financial text.

    from sentence_transformers import SentenceTransformer
    import numpy as np
    
    class EmbeddingPipeline:
        def __init__(self, model_name: str = "all-mpnet-base-v2", batch_size: int = 64):
            self.model = SentenceTransformer(model_name)
            self.batch_size = batch_size
    
        def embed_chunks(self, chunks: list[str]) -> np.ndarray:
            all_embeddings = []
            for i in range(0, len(chunks), self.batch_size):
                batch = chunks[i:i + self.batch_size]
                embeddings = self.model.encode(
                    batch,
                    normalize_embeddings=True,
                    show_progress_bar=False
                )
                all_embeddings.append(embeddings)
            return np.vstack(all_embeddings)
    
        def embed_query(self, query: str) -> np.ndarray:
            return self.model.encode(
                [query],
                normalize_embeddings=True
            )[0]
    

    Always normalize embeddings before storing. It makes cosine similarity equivalent to dot product, which most vector stores optimize for.

    Embedding model selection checklist:

  • Does the model have a training domain that matches your data?
  • What is the maximum sequence length? Chunks longer than the model's context are silently truncated.
  • What is the embedding dimension? Higher dimensions cost more storage and are slower to search but carry more signal.
  • Is it the same model used at indexing time and query time? A mismatch produces garbage retrieval.
  • ---

    Most tutorials stop at "embed your chunks and put them in a vector store." Production RAG needs more.

    Metadata Filtering

    Similarity alone is insufficient. You also need to filter by document type, date range, access permissions, or user context. A question about Q1 2026 earnings should not retrieve Q1 2024 documents even if they're semantically similar.

    import chromadb
    
    client = chromadb.Client()
    collection = client.get_or_create_collection("knowledge_base")
    
    def index_chunk(chunk_id: str, text: str, embedding: list[float], metadata: dict):
        collection.add(
            ids=[chunk_id],
            embeddings=[embedding],
            documents=[text],
            metadatas=[metadata]
        )
    
    def retrieve_with_filter(
        query_embedding: list[float],
        top_k: int = 10,
        filters: dict | None = None
    ) -> list[dict]:
        where_clause = filters or {}
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            where=where_clause if where_clause else None,
            include=["documents", "metadatas", "distances"]
        )
        return [
            {
                "text": doc,
                "metadata": meta,
                "distance": dist
            }
            for doc, meta, dist in zip(
                results["documents"][0],
                results["metadatas"][0],
                results["distances"][0]
            )
        ]
    

    Design your metadata schema deliberately before indexing. Retrofitting metadata onto an existing index means re-indexing everything.

    Hybrid Search — Dense + Sparse

    Pure vector search misses exact keyword matches that BM25 (sparse retrieval) handles well. A user searching for "RFC 7519" expects exact string matches, not semantic neighbors. Hybrid search runs both and combines the scores.

    from rank_bm25 import BM25Okapi
    
    class HybridRetriever:
        def __init__(self, chunks: list[str], embeddings: np.ndarray):
            self.chunks = chunks
            self.embeddings = embeddings
            tokenized = [c.lower().split() for c in chunks]
            self.bm25 = BM25Okapi(tokenized)
    
        def retrieve(
            self,
            query: str,
            query_embedding: np.ndarray,
            top_k: int = 10,
            alpha: float = 0.6
        ) -> list[dict]:
            # Sparse scores
            bm25_scores = self.bm25.get_scores(query.lower().split())
            bm25_norm = bm25_scores / (bm25_scores.max() + 1e-9)
    
            # Dense scores
            dense_scores = self.embeddings @ query_embedding
            dense_norm = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-9)
    
            # Reciprocal rank fusion weighted by alpha
            combined = alpha * dense_norm + (1 - alpha) * bm25_norm
            top_indices = np.argsort(combined)[-top_k:][::-1]
    
            return [
                {"text": self.chunks[i], "score": float(combined[i])}
                for i in top_indices
            ]
    

    alpha controls the dense/sparse balance. 0.6–0.7 is a reasonable starting point; tune it against your eval suite.

    ---

    Re-ranking — The Quality Gate Before the LLM

    The vector store returns the top-K candidates. Not all of them are equally useful. A cross-encoder re-ranker takes each candidate and scores it against the query with full attention — it's slower but far more accurate than embedding similarity.

    from sentence_transformers import CrossEncoder
    
    reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    
    def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
        pairs = [(query, c["text"]) for c in candidates]
        scores = reranker.predict(pairs)
    
        for candidate, score in zip(candidates, scores):
            candidate["rerank_score"] = float(score)
    
        ranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return ranked[:top_n]
    

    The pattern: retrieve top-20 with vector search (cheap), re-rank to top-5 with a cross-encoder (accurate). The LLM sees only the 5 best. This dramatically reduces context noise and the hallucination rate from irrelevant context.

    ---

    Context Construction — What the LLM Actually Sees

    The retrieved chunks go into the prompt. How you format them determines whether the model can use them effectively.

    from anthropic import Anthropic
    
    client = Anthropic()
    
    def build_context_prompt(chunks: list[dict]) -> str:
        sections = []
        for i, chunk in enumerate(chunks, 1):
            meta = chunk.get("metadata", {})
            source = meta.get("title", f"Source {i}")
            date = meta.get("date", "")
            header = f"[{i}] {source}" + (f" ({date})" if date else "")
            sections.append(f"{header}\n{chunk['text']}")
        return "\n\n---\n\n".join(sections)
    
    def answer_query(query: str, chunks: list[dict]) -> str:
        context = build_context_prompt(chunks)
    
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system="""You are a precise assistant. Answer using only the provided sources.
    Cite sources by their number in brackets, e.g. [1], [2].
    If the answer is not in the sources, say: "I don't have enough information to answer this."
    Do not speculate or use knowledge beyond what is provided.""",
            messages=[
                {
                    "role": "user",
                    "content": f"Sources:\n\n{context}\n\n---\n\nQuestion: {query}"
                }
            ]
        )
        return response.content[0].text
    

    Three rules for context construction:

  • Label every source. The model uses citations better when sources are numbered and titled.
  • Put the most relevant chunk first. Models attend more to the beginning and end of context. Don't bury the best match in the middle.
  • Strip irrelevant context aggressively. Every chunk that doesn't help is noise that makes the model less accurate. Re-ranking handles this, but also review your top-N threshold.
  • ---

    Evaluating RAG — Measuring What Matters

    RAG quality has three independent failure points. Measure them separately.

    from dataclasses import dataclass
    
    @dataclass
    class RAGEvalCase:
        query: str
        relevant_chunk_ids: list[str]
        expected_answer_keywords: list[str]
    
    def retrieval_recall(case: RAGEvalCase, retrieved_ids: list[str]) -> float:
        relevant = set(case.relevant_chunk_ids)
        retrieved = set(retrieved_ids)
        if not relevant:
            return 1.0
        return len(relevant & retrieved) / len(relevant)
    
    def answer_faithfulness(answer: str, source_chunks: list[str]) -> float:
        combined_sources = " ".join(source_chunks).lower()
        sentences = [s.strip() for s in answer.split(".") if s.strip()]
        grounded = sum(
            1 for s in sentences
            if any(word in combined_sources for word in s.lower().split() if len(word) > 4)
        )
        return grounded / len(sentences) if sentences else 0.0
    
    def answer_relevance(answer: str, case: RAGEvalCase) -> float:
        answer_lower = answer.lower()
        matched = sum(1 for kw in case.expected_answer_keywords if kw.lower() in answer_lower)
        return matched / len(case.expected_answer_keywords) if case.expected_answer_keywords else 0.0
    

    Retrieval recall measures whether the right chunks were retrieved. If this is low, fix chunking or retrieval — not the prompt.

    Answer faithfulness measures whether the answer is grounded in the retrieved context. Low faithfulness means the model is hallucinating beyond what was provided.

    Answer relevance measures whether the answer addresses the query. Low relevance with high faithfulness means the retrieval returned correct but off-topic chunks.

    Each failure mode has a different fix. Diagnosing which one you have saves weeks of debugging in the wrong layer.

    ---

    Production Concerns

    Index Freshness

    Your documents change. Your index needs to reflect that. Build an incremental indexing pipeline — don't re-embed the entire corpus on every update.

    import hashlib
    
    def document_fingerprint(text: str) -> str:
        return hashlib.sha256(text.encode()).hexdigest()
    
    def needs_reindexing(doc_id: str, current_text: str, index_registry: dict) -> bool:
        current_fp = document_fingerprint(current_text)
        stored_fp = index_registry.get(doc_id)
        return current_fp != stored_fp
    

    Track a fingerprint of every indexed document. On ingest, check the fingerprint — only re-embed documents that changed. Delete stale chunks by document ID before adding new ones.

    Latency Budget

    A RAG request has multiple network hops: embedding the query, querying the vector store, calling the re-ranker, calling the LLM. Profile each stage separately.

    import time
    from contextlib import contextmanager
    
    @contextmanager
    def timed(label: str, log: dict):
        start = time.monotonic()
        yield
        log[label] = round((time.monotonic() - start) * 1000)
    
    def rag_pipeline(query: str) -> tuple[str, dict]:
        timings = {}
        pipeline = EmbeddingPipeline()
    
        with timed("embed_query", timings):
            query_vec = pipeline.embed_query(query)
    
        with timed("vector_search", timings):
            candidates = retrieve_with_filter(query_vec.tolist(), top_k=20)
    
        with timed("rerank", timings):
            top_chunks = rerank(query, candidates, top_n=5)
    
        with timed("llm_call", timings):
            answer = answer_query(query, top_chunks)
    
        return answer, timings
    

    In practice, the LLM call dominates. But the vector search and re-ranking add up at scale. Cache embedding results for repeated queries. Consider an ANN index (HNSW) over exact search as your corpus grows.

    Guardrails

    RAG reduces hallucination but doesn't eliminate it. Add a verification pass for high-stakes outputs: re-run the answer against the retrieved chunks and check whether each factual claim is supported.

    def verify_answer(answer: str, context_chunks: list[dict]) -> dict:
        context = build_context_prompt(context_chunks)
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=256,
            system='Respond only with JSON: {"grounded": true|false, "unsupported_claims": ["list"]}',
            messages=[
                {
                    "role": "user",
                    "content": f"Sources:\n{context}\n\nAnswer to verify:\n{answer}"
                }
            ]
        )
        import json
        return json.loads(response.content[0].text)
    

    This costs an extra LLM call. For a medical, legal, or financial use case, it's not optional.

    ---

    The Architecture That Holds

    A production RAG system is a pipeline with clear stages and clear ownership.

    The indexing pipeline owns document ingestion, chunking, embedding, and storage. It runs asynchronously, triggered by document changes. It maintains a fingerprint registry for incremental updates.

    The retrieval pipeline owns query embedding, vector search, hybrid scoring, and re-ranking. It runs synchronously per query. It logs retrieved chunk IDs and scores on every request.

    The generation pipeline owns context construction, prompt assembly, LLM calls, and optionally verification. It is the consumer of retrieval outputs — it does not search, it does not embed.

    Keep these layers separate. The retrieval pipeline should be testable without the LLM. The indexing pipeline should be runnable without a live query. Most RAG bugs are found by isolating the layer that owns the failure.

    The model is not the system. The retrieval is.

    Tags

    RAGLLMAI EngineeringVector SearchPython
    Y

    Yudi Nugraha

    Software Engineer | Builder

    More Articles

    Explore more articles on similar topics

    View All Articles