Yudi Nugraha

Every LLM has the same fundamental limitation: its knowledge is frozen at training time. Ask it about your internal documentation, your product catalog, your customer history, or anything that happened after its cutoff date — and it either guesses, or it says it doesn't know.

For most real-world AI applications, a model that only knows what it learned during training is not enough.

Retrieval-Augmented Generation — RAG — is the standard engineering solution to this problem. Instead of relying on the model's memorized weights, you retrieve relevant information from your own data sources at the time of each query, inject that information into the prompt, and let the model reason over it.

The model's job shifts from remembering to reasoning. That's a job it's much better at.

---

The Problem RAG Solves

Consider what happens when you ask an LLM a question it can't answer from training alone.

If you ask "What is our refund policy for digital products?", the model has no idea. It wasn't trained on your policy. It might generate a plausible-sounding policy based on patterns from other companies — which is worse than saying nothing, because it sounds authoritative.

The naive fix is fine-tuning: retrain the model on your data. This has real problems:

Fine-tuning is expensive and slow — you can't do it every time a document changes

Fine-tuned knowledge is fuzzy — models don't memorize documents verbatim, they absorb patterns

You can't reliably cite where fine-tuned knowledge came from

It doesn't scale to large, frequently changing knowledge bases

RAG solves these problems differently. Instead of baking knowledge into the weights, you retrieve it on demand and hand it to the model as context. The model doesn't need to remember your refund policy — it just needs to read it.

---

How RAG Works — The Core Concept

RAG has two phases that run at different times.

Phase 1 — Indexing (Offline)

Before any queries are answered, you build the index. Every document in your knowledge base gets:

Cleaned and split into smaller chunks

Converted into a numerical vector (an embedding) that captures its meaning

Stored in a vector database alongside the original text

This happens once, then incrementally as documents change.

Phase 2 — Retrieval and Generation (At Query Time)

When a user asks a question:

The question is converted into the same kind of embedding vector

The vector database finds chunks whose embeddings are most similar to the question

The most relevant chunks are inserted into the prompt as context

The LLM reads the context and generates a grounded answer

User question
     │
     ▼
[Embed question]
     │
     ▼
[Search vector DB] ──► [Retrieved chunks]
                               │
                               ▼
                    [Build prompt with context]
                               │
                               ▼
                         [LLM generates answer]
                               │
                               ▼
                      Answer with citations

The model never "searches" anything. It only reads what you put in the prompt. Your retrieval system is what does the searching.

---

A Minimal Working RAG System

Here's the simplest complete RAG implementation — no framework, no abstraction layer, just the core mechanics.

from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# --- Indexing phase ---

documents = [
    "Digital product refunds are accepted within 14 days of purchase if the product has not been downloaded.",
    "Physical product returns must be initiated within 30 days. Items must be in original packaging.",
    "Subscription cancellations take effect at the end of the current billing period. No partial refunds.",
    "If a product is defective, a full refund or replacement is offered regardless of time since purchase.",
    "Refund requests can be submitted through the Help Center under Orders > Request Refund.",
]

doc_embeddings = embedder.encode(documents, normalize_embeddings=True)


# --- Retrieval phase ---

def retrieve(query: str, top_k: int = 3) -> list[str]:
    query_vec = embedder.encode([query], normalize_embeddings=True)[0]
    scores = doc_embeddings @ query_vec
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [documents[i] for i in top_indices]


# --- Generation phase ---

def answer(query: str) -> str:
    chunks = retrieve(query)
    context = "\n".join(f"- {c}" for c in chunks)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="""Answer using only the provided context. 
If the context does not contain the answer, say so clearly. 
Do not use outside knowledge.""",
        messages=[
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}"
            }
        ]
    )
    return response.content[0].text


print(answer("Can I get a refund on a subscription I just bought?"))

This is functional RAG in under 50 lines. Every production RAG system is this structure with more sophistication at each stage.

---

The Key Components Explained

Embeddings — Meaning as Numbers

An embedding is a vector (a list of numbers) that represents the meaning of a piece of text. Texts with similar meanings have vectors that point in similar directions in high-dimensional space. This is what makes semantic search possible — you can find chunks about "cancellation policy" even if the query uses different words like "stop my subscription."

text_a = "How do I cancel my plan?"
text_b = "Steps to terminate a subscription"
text_c = "What is the weather like today?"

vecs = embedder.encode([text_a, text_b, text_c], normalize_embeddings=True)

# a and b are semantically similar → high dot product
print(f"a·b similarity: {vecs[0] @ vecs[1]:.3f}")  # ~0.85

# a and c are unrelated → low dot product
print(f"a·c similarity: {vecs[0] @ vecs[2]:.3f}")  # ~0.12

The embedding model is a separate neural network trained specifically to map text to vectors. It is not the same as the LLM. You use it twice: once at indexing time to embed your documents, and once at query time to embed the user's question. They must be the same model both times — different models produce incompatible vector spaces.

Vector Database — Similarity Search at Scale

A vector database stores embeddings and allows you to search them by similarity efficiently. When you have millions of document chunks, you can't compare the query vector to every single one — the database uses approximate nearest neighbor algorithms (HNSW, IVF) to find the closest matches in milliseconds.

Popular options:

Database	Best for
Chroma	Local development, small projects
Pinecone	Managed cloud, production at scale
Weaviate	Open source, self-hosted with rich filtering
pgvector	Already on PostgreSQL, don't want another service
Qdrant	Open source, high performance, Docker-friendly

For a new project, start with Chroma locally. For production, choose based on your infrastructure — if you're already on PostgreSQL, pgvector adds the least operational overhead.

Chunking — Breaking Documents Into Retrieval Units

You index chunks, not entire documents. A chunk should be small enough to be relevant to a specific question, and large enough to contain enough context to answer it.

The simplest chunking strategy — fixed size with overlap:

def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunks.append(" ".join(words[start:end]))
        start += chunk_size - overlap
    return chunks

document = "Your long document text here..."
chunks = chunk_text(document)

The overlap ensures that information near a chunk boundary doesn't get split between two chunks and lost in both. A sentence that starts near the end of chunk 1 appears again at the start of chunk 2.

Chunk size is a tuning parameter. Smaller chunks are more precise but may lack context. Larger chunks are more contextual but may match too broadly. 200–500 words is a reasonable starting range for most document types.

---

What Makes RAG Work Well

The quality of a RAG system is almost entirely determined by retrieval quality, not generation quality. The LLM is a capable reasoner — give it the right information and it will produce a good answer. Give it the wrong information (or no information) and no model is smart enough to compensate.

Three things determine retrieval quality:

Chunking strategy. If your chunks split important information across boundaries, or if they're too large to be specific, retrieval will return chunks that are relevant-adjacent but not actually useful. Poor chunking is the most common root cause of RAG failures.

Embedding model. A general-purpose embedding model works for general text. Domain-specific text — legal, medical, financial, code — often benefits from a domain-specific embedding model. If your similarity scores look right but retrieval is still wrong, the embedding model is the first thing to investigate.

Top-K and filtering. How many chunks you retrieve, and whether you filter by metadata before retrieving, determines what the model sees. Too few chunks and you miss relevant information. Too many and you dilute the signal with noise — the model's attention spreads across irrelevant context and accuracy drops.

---

What RAG Cannot Do

Understanding the limits of RAG is as important as understanding how to use it.

RAG does not make the model smarter. If the model can't reason correctly over a piece of information, providing that information via RAG doesn't fix it. RAG solves the knowledge access problem, not the reasoning quality problem.

RAG does not guarantee accuracy. If the wrong chunks are retrieved — either because retrieval failed or because the relevant information doesn't exist in the index — the model will either say it doesn't know (good) or hallucinate based on what it did receive (bad). Good retrieval is the guardrail, not the model.

RAG is not a substitute for structured data. If a user asks "what is the current price of product X?", the answer should come from a database query, not from a document retrieved by similarity search. Use RAG for knowledge that is best expressed in natural language. Use your existing data systems for structured, precise, frequently updated facts.

RAG adds latency. Every query now involves embedding the question, searching the vector database, optionally re-ranking results, and then calling the LLM. Each step adds time. For latency-sensitive applications, budget for this from the start.

---

When to Use RAG

Use RAG when:

Your application needs to answer questions over private documents the LLM was not trained on

Your knowledge base changes frequently enough that fine-tuning would be impractical

You need citations — specific sources the user can verify

You need to control what information the model can access (compliance, access control)

Consider alternatives when:

The information is small and stable — just put it in the system prompt

The task requires precise structured data — query your database directly

Latency is critical and the knowledge is narrow — a fine-tuned model may be faster

The question requires reasoning over the entire document, not a retrieved fragment

---

The Bigger Picture

RAG is one answer to a foundational question in AI engineering: how do you connect a general-purpose reasoning engine to specific, private, or current information?

The mental model that helps: the LLM is a reasoning engine, not a knowledge store. It is very good at reading a piece of text and drawing conclusions from it. It is unreliable as a database of facts. Design your system accordingly — store knowledge in systems built for knowledge storage (documents, databases, vector stores), and let the LLM do what it's good at: reading that knowledge and reasoning over it.

RAG is the bridge between those two roles. The retrieval system handles the "find the right information" problem. The LLM handles the "make sense of the information and answer the question" problem. Neither does the other's job.

Build that separation cleanly and RAG works well. Blur it and you'll spend weeks debugging failures that are actually architectural.

That's the foundation. The engineering depth comes next.

Introduction to RAG Systems — Giving LLMs Access to Your Data