Yudi Nugraha

Retrieval-Augmented Generation is the right answer to a specific problem: you have an LLM that knows how to reason, and private data that changes too often to fine-tune on. RAG connects them at query time — retrieve what's relevant, hand it to the model, let the model reason over it.

The concept fits in one sentence. The implementation does not.

Most teams underestimate the pipeline complexity and overestimate the model's ability to compensate for retrieval failures. The model is the last mile. If retrieval is broken, the model hallucinates confidently. The user never knows the difference. That's the failure mode that makes RAG dangerous to ship without rigorous engineering.

This post is about building the retrieval layer correctly.

---

The RAG Pipeline — What's Actually Happening

A production RAG system has two distinct phases that run at different times.

Indexing (offline, runs on document ingestion):

Load raw documents from source

Clean and normalize the text

Chunk documents into retrieval units

Embed each chunk into a vector representation

Store chunks and vectors in a vector database

Retrieval + Generation (online, runs per query):

Embed the user query

Search the vector database for similar chunks

Re-rank and filter retrieved chunks

Build a context-injected prompt

Call the LLM and return the response

Every step has failure modes. Most RAG problems trace back to steps 3 or 2 — chunking strategy and retrieval quality. Engineers tend to focus on step 5 first. That's backwards.

---

Chunking — The Decision That Determines Everything

Chunking is where most RAG systems fail silently. A poor chunking strategy produces chunks that are either too small (missing context) or too large (burying relevant signal in noise). Both cause retrieval failures that look like model failures.

Fixed-Size Chunking — Simple but Fragile

def chunk_fixed(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    tokens = text.split()
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(" ".join(tokens[start:end]))
        start += chunk_size - overlap
    return chunks

The overlap preserves context across boundaries. Without it, a sentence split across two chunks loses coherence in both. Fixed-size chunking is the baseline — it works for homogeneous text (logs, product descriptions) but struggles with structured documents where a paragraph boundary mid-chunk loses the surrounding concept.

Semantic Chunking — Split on Meaning, Not Token Count

Better approach: split at natural semantic boundaries. Sentences whose embedding shifts significantly from the previous sentence signal a topic boundary.

from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def chunk_semantic(
    text: str,
    similarity_threshold: float = 0.75,
    max_chunk_tokens: int = 400
) -> list[str]:
    sentences = [s.strip() for s in text.split(".") if s.strip()]
    if not sentences:
        return []

    embeddings = embedder.encode(sentences)
    chunks, current = [], [sentences[0]]

    for i in range(1, len(sentences)):
        sim = cosine_similarity(embeddings[i - 1], embeddings[i])
        current_len = sum(len(s.split()) for s in current)

        if sim < similarity_threshold or current_len >= max_chunk_tokens:
            chunks.append(". ".join(current) + ".")
            current = [sentences[i]]
        else:
            current.append(sentences[i])

    if current:
        chunks.append(". ".join(current) + ".")

    return chunks

This costs more at indexing time but produces chunks aligned with the document's structure. The retrieved chunk is more likely to contain the full thought the query is asking about.

Hierarchical Chunking — Index Multiple Granularities

For long, structured documents (legal contracts, technical manuals, API docs), index at multiple granularities: the full document summary, section-level chunks, and paragraph-level chunks. At retrieval time, match at the paragraph level but return the surrounding section as context.

from dataclasses import dataclass, field

@dataclass
class HierarchicalChunk:
    id: str
    text: str
    level: str          # "document" | "section" | "paragraph"
    parent_id: str | None = None
    metadata: dict = field(default_factory=dict)

def build_hierarchy(document: dict) -> list[HierarchicalChunk]:
    chunks = []
    doc_id = document["id"]

    chunks.append(HierarchicalChunk(
        id=doc_id,
        text=document["summary"],
        level="document",
        parent_id=None,
        metadata={"title": document["title"]}
    ))

    for i, section in enumerate(document.get("sections", [])):
        section_id = f"{doc_id}::s{i}"
        chunks.append(HierarchicalChunk(
            id=section_id,
            text=section["content"][:600],
            level="section",
            parent_id=doc_id,
            metadata={"section_title": section["title"]}
        ))

        for j, para in enumerate(section.get("paragraphs", [])):
            chunks.append(HierarchicalChunk(
                id=f"{section_id}::p{j}",
                text=para,
                level="paragraph",
                parent_id=section_id,
                metadata={"section_title": section["title"]}
            ))

    return chunks

Retrieve at paragraph level, then walk up the parent chain to include section-level context in the prompt. The model gets a precise match and the surrounding framing.

---

Embedding — Choosing and Using the Right Model

The embedding model determines what "similar" means in your vector space. A general-purpose embedder trained on web text will not perform the same as a domain-specific one on medical, legal, or financial text.

from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingPipeline:
    def __init__(self, model_name: str = "all-mpnet-base-v2", batch_size: int = 64):
        self.model = SentenceTransformer(model_name)
        self.batch_size = batch_size

    def embed_chunks(self, chunks: list[str]) -> np.ndarray:
        all_embeddings = []
        for i in range(0, len(chunks), self.batch_size):
            batch = chunks[i:i + self.batch_size]
            embeddings = self.model.encode(
                batch,
                normalize_embeddings=True,
                show_progress_bar=False
            )
            all_embeddings.append(embeddings)
        return np.vstack(all_embeddings)

    def embed_query(self, query: str) -> np.ndarray:
        return self.model.encode(
            [query],
            normalize_embeddings=True
        )[0]

Always normalize embeddings before storing. It makes cosine similarity equivalent to dot product, which most vector stores optimize for.

Embedding model selection checklist:

Does the model have a training domain that matches your data?

What is the maximum sequence length? Chunks longer than the model's context are silently truncated.

What is the embedding dimension? Higher dimensions cost more storage and are slower to search but carry more signal.

Is it the same model used at indexing time and query time? A mismatch produces garbage retrieval.

---

The Vector Store — Beyond Similarity Search

Most tutorials stop at "embed your chunks and put them in a vector store." Production RAG needs more.

Metadata Filtering

Similarity alone is insufficient. You also need to filter by document type, date range, access permissions, or user context. A question about Q1 2026 earnings should not retrieve Q1 2024 documents even if they're semantically similar.

import chromadb

client = chromadb.Client()
collection = client.get_or_create_collection("knowledge_base")

def index_chunk(chunk_id: str, text: str, embedding: list[float], metadata: dict):
    collection.add(
        ids=[chunk_id],
        embeddings=[embedding],
        documents=[text],
        metadatas=[metadata]
    )

def retrieve_with_filter(
    query_embedding: list[float],
    top_k: int = 10,
    filters: dict | None = None
) -> list[dict]:
    where_clause = filters or {}
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where=where_clause if where_clause else None,
        include=["documents", "metadatas", "distances"]
    )
    return [
        {
            "text": doc,
            "metadata": meta,
            "distance": dist
        }
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
    ]

Design your metadata schema deliberately before indexing. Retrofitting metadata onto an existing index means re-indexing everything.

Hybrid Search — Dense + Sparse

Pure vector search misses exact keyword matches that BM25 (sparse retrieval) handles well. A user searching for "RFC 7519" expects exact string matches, not semantic neighbors. Hybrid search runs both and combines the scores.

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, chunks: list[str], embeddings: np.ndarray):
        self.chunks = chunks
        self.embeddings = embeddings
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(
        self,
        query: str,
        query_embedding: np.ndarray,
        top_k: int = 10,
        alpha: float = 0.6
    ) -> list[dict]:
        # Sparse scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_norm = bm25_scores / (bm25_scores.max() + 1e-9)

        # Dense scores
        dense_scores = self.embeddings @ query_embedding
        dense_norm = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-9)

        # Reciprocal rank fusion weighted by alpha
        combined = alpha * dense_norm + (1 - alpha) * bm25_norm
        top_indices = np.argsort(combined)[-top_k:][::-1]

        return [
            {"text": self.chunks[i], "score": float(combined[i])}
            for i in top_indices
        ]

alpha controls the dense/sparse balance. 0.6–0.7 is a reasonable starting point; tune it against your eval suite.

---

Re-ranking — The Quality Gate Before the LLM

The vector store returns the top-K candidates. Not all of them are equally useful. A cross-encoder re-ranker takes each candidate and scores it against the query with full attention — it's slower but far more accurate than embedding similarity.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)

    for candidate, score in zip(candidates, scores):
        candidate["rerank_score"] = float(score)

    ranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
    return ranked[:top_n]

The pattern: retrieve top-20 with vector search (cheap), re-rank to top-5 with a cross-encoder (accurate). The LLM sees only the 5 best. This dramatically reduces context noise and the hallucination rate from irrelevant context.

---

Context Construction — What the LLM Actually Sees

The retrieved chunks go into the prompt. How you format them determines whether the model can use them effectively.

from anthropic import Anthropic

client = Anthropic()

def build_context_prompt(chunks: list[dict]) -> str:
    sections = []
    for i, chunk in enumerate(chunks, 1):
        meta = chunk.get("metadata", {})
        source = meta.get("title", f"Source {i}")
        date = meta.get("date", "")
        header = f"[{i}] {source}" + (f" ({date})" if date else "")
        sections.append(f"{header}\n{chunk['text']}")
    return "\n\n---\n\n".join(sections)

def answer_query(query: str, chunks: list[dict]) -> str:
    context = build_context_prompt(chunks)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="""You are a precise assistant. Answer using only the provided sources.
Cite sources by their number in brackets, e.g. [1], [2].
If the answer is not in the sources, say: "I don't have enough information to answer this."
Do not speculate or use knowledge beyond what is provided.""",
        messages=[
            {
                "role": "user",
                "content": f"Sources:\n\n{context}\n\n---\n\nQuestion: {query}"
            }
        ]
    )
    return response.content[0].text

Three rules for context construction:

Label every source. The model uses citations better when sources are numbered and titled.

Put the most relevant chunk first. Models attend more to the beginning and end of context. Don't bury the best match in the middle.

Strip irrelevant context aggressively. Every chunk that doesn't help is noise that makes the model less accurate. Re-ranking handles this, but also review your top-N threshold.

---

Evaluating RAG — Measuring What Matters

RAG quality has three independent failure points. Measure them separately.

from dataclasses import dataclass

@dataclass
class RAGEvalCase:
    query: str
    relevant_chunk_ids: list[str]
    expected_answer_keywords: list[str]

def retrieval_recall(case: RAGEvalCase, retrieved_ids: list[str]) -> float:
    relevant = set(case.relevant_chunk_ids)
    retrieved = set(retrieved_ids)
    if not relevant:
        return 1.0
    return len(relevant & retrieved) / len(relevant)

def answer_faithfulness(answer: str, source_chunks: list[str]) -> float:
    combined_sources = " ".join(source_chunks).lower()
    sentences = [s.strip() for s in answer.split(".") if s.strip()]
    grounded = sum(
        1 for s in sentences
        if any(word in combined_sources for word in s.lower().split() if len(word) > 4)
    )
    return grounded / len(sentences) if sentences else 0.0

def answer_relevance(answer: str, case: RAGEvalCase) -> float:
    answer_lower = answer.lower()
    matched = sum(1 for kw in case.expected_answer_keywords if kw.lower() in answer_lower)
    return matched / len(case.expected_answer_keywords) if case.expected_answer_keywords else 0.0

Retrieval recall measures whether the right chunks were retrieved. If this is low, fix chunking or retrieval — not the prompt.

Answer faithfulness measures whether the answer is grounded in the retrieved context. Low faithfulness means the model is hallucinating beyond what was provided.

Answer relevance measures whether the answer addresses the query. Low relevance with high faithfulness means the retrieval returned correct but off-topic chunks.

Each failure mode has a different fix. Diagnosing which one you have saves weeks of debugging in the wrong layer.

---

Production Concerns

Index Freshness

Your documents change. Your index needs to reflect that. Build an incremental indexing pipeline — don't re-embed the entire corpus on every update.

import hashlib

def document_fingerprint(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

def needs_reindexing(doc_id: str, current_text: str, index_registry: dict) -> bool:
    current_fp = document_fingerprint(current_text)
    stored_fp = index_registry.get(doc_id)
    return current_fp != stored_fp

Track a fingerprint of every indexed document. On ingest, check the fingerprint — only re-embed documents that changed. Delete stale chunks by document ID before adding new ones.

Latency Budget

A RAG request has multiple network hops: embedding the query, querying the vector store, calling the re-ranker, calling the LLM. Profile each stage separately.

import time
from contextlib import contextmanager

@contextmanager
def timed(label: str, log: dict):
    start = time.monotonic()
    yield
    log[label] = round((time.monotonic() - start) * 1000)

def rag_pipeline(query: str) -> tuple[str, dict]:
    timings = {}
    pipeline = EmbeddingPipeline()

    with timed("embed_query", timings):
        query_vec = pipeline.embed_query(query)

    with timed("vector_search", timings):
        candidates = retrieve_with_filter(query_vec.tolist(), top_k=20)

    with timed("rerank", timings):
        top_chunks = rerank(query, candidates, top_n=5)

    with timed("llm_call", timings):
        answer = answer_query(query, top_chunks)

    return answer, timings

In practice, the LLM call dominates. But the vector search and re-ranking add up at scale. Cache embedding results for repeated queries. Consider an ANN index (HNSW) over exact search as your corpus grows.

Guardrails

RAG reduces hallucination but doesn't eliminate it. Add a verification pass for high-stakes outputs: re-run the answer against the retrieved chunks and check whether each factual claim is supported.

def verify_answer(answer: str, context_chunks: list[dict]) -> dict:
    context = build_context_prompt(context_chunks)
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        system='Respond only with JSON: {"grounded": true|false, "unsupported_claims": ["list"]}',
        messages=[
            {
                "role": "user",
                "content": f"Sources:\n{context}\n\nAnswer to verify:\n{answer}"
            }
        ]
    )
    import json
    return json.loads(response.content[0].text)

This costs an extra LLM call. For a medical, legal, or financial use case, it's not optional.

---

The Architecture That Holds

A production RAG system is a pipeline with clear stages and clear ownership.

The indexing pipeline owns document ingestion, chunking, embedding, and storage. It runs asynchronously, triggered by document changes. It maintains a fingerprint registry for incremental updates.

The retrieval pipeline owns query embedding, vector search, hybrid scoring, and re-ranking. It runs synchronously per query. It logs retrieved chunk IDs and scores on every request.

The generation pipeline owns context construction, prompt assembly, LLM calls, and optionally verification. It is the consumer of retrieval outputs — it does not search, it does not embed.

Keep these layers separate. The retrieval pipeline should be testable without the LLM. The indexing pipeline should be runnable without a live query. Most RAG bugs are found by isolating the layer that owns the failure.

The model is not the system. The retrieval is.

Building RAG Systems That Actually Work in Production