Every LLM has the same fundamental limitation: its knowledge is frozen at training time. Ask it about your internal documentation, your product catalog, your customer history, or anything that happened after its cutoff date — and it either guesses, or it says it doesn't know.
For most real-world AI applications, a model that only knows what it learned during training is not enough.
Retrieval-Augmented Generation — RAG — is the standard engineering solution to this problem. Instead of relying on the model's memorized weights, you retrieve relevant information from your own data sources at the time of each query, inject that information into the prompt, and let the model reason over it.
The model's job shifts from remembering to reasoning. That's a job it's much better at.
---
The Problem RAG Solves
Consider what happens when you ask an LLM a question it can't answer from training alone.
If you ask "What is our refund policy for digital products?", the model has no idea. It wasn't trained on your policy. It might generate a plausible-sounding policy based on patterns from other companies — which is worse than saying nothing, because it sounds authoritative.
The naive fix is fine-tuning: retrain the model on your data. This has real problems:
RAG solves these problems differently. Instead of baking knowledge into the weights, you retrieve it on demand and hand it to the model as context. The model doesn't need to remember your refund policy — it just needs to read it.
---
How RAG Works — The Core Concept
RAG has two phases that run at different times.
Phase 1 — Indexing (Offline)
Before any queries are answered, you build the index. Every document in your knowledge base gets:
This happens once, then incrementally as documents change.
Phase 2 — Retrieval and Generation (At Query Time)
When a user asks a question:
User question
│
▼
[Embed question]
│
▼
[Search vector DB] ──► [Retrieved chunks]
│
▼
[Build prompt with context]
│
▼
[LLM generates answer]
│
▼
Answer with citations
The model never "searches" anything. It only reads what you put in the prompt. Your retrieval system is what does the searching.
---
A Minimal Working RAG System
Here's the simplest complete RAG implementation — no framework, no abstraction layer, just the core mechanics.
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# --- Indexing phase ---
documents = [
"Digital product refunds are accepted within 14 days of purchase if the product has not been downloaded.",
"Physical product returns must be initiated within 30 days. Items must be in original packaging.",
"Subscription cancellations take effect at the end of the current billing period. No partial refunds.",
"If a product is defective, a full refund or replacement is offered regardless of time since purchase.",
"Refund requests can be submitted through the Help Center under Orders > Request Refund.",
]
doc_embeddings = embedder.encode(documents, normalize_embeddings=True)
# --- Retrieval phase ---
def retrieve(query: str, top_k: int = 3) -> list[str]:
query_vec = embedder.encode([query], normalize_embeddings=True)[0]
scores = doc_embeddings @ query_vec
top_indices = np.argsort(scores)[-top_k:][::-1]
return [documents[i] for i in top_indices]
# --- Generation phase ---
def answer(query: str) -> str:
chunks = retrieve(query)
context = "\n".join(f"- {c}" for c in chunks)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="""Answer using only the provided context.
If the context does not contain the answer, say so clearly.
Do not use outside knowledge.""",
messages=[
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}
]
)
return response.content[0].text
print(answer("Can I get a refund on a subscription I just bought?"))
This is functional RAG in under 50 lines. Every production RAG system is this structure with more sophistication at each stage.
---
The Key Components Explained
Embeddings — Meaning as Numbers
An embedding is a vector (a list of numbers) that represents the meaning of a piece of text. Texts with similar meanings have vectors that point in similar directions in high-dimensional space. This is what makes semantic search possible — you can find chunks about "cancellation policy" even if the query uses different words like "stop my subscription."
text_a = "How do I cancel my plan?"
text_b = "Steps to terminate a subscription"
text_c = "What is the weather like today?"
vecs = embedder.encode([text_a, text_b, text_c], normalize_embeddings=True)
# a and b are semantically similar → high dot product
print(f"a·b similarity: {vecs[0] @ vecs[1]:.3f}") # ~0.85
# a and c are unrelated → low dot product
print(f"a·c similarity: {vecs[0] @ vecs[2]:.3f}") # ~0.12
The embedding model is a separate neural network trained specifically to map text to vectors. It is not the same as the LLM. You use it twice: once at indexing time to embed your documents, and once at query time to embed the user's question. They must be the same model both times — different models produce incompatible vector spaces.
Vector Database — Similarity Search at Scale
A vector database stores embeddings and allows you to search them by similarity efficiently. When you have millions of document chunks, you can't compare the query vector to every single one — the database uses approximate nearest neighbor algorithms (HNSW, IVF) to find the closest matches in milliseconds.
Popular options:
| Database | Best for |
|---|---|
| Chroma | Local development, small projects |
| Pinecone | Managed cloud, production at scale |
| Weaviate | Open source, self-hosted with rich filtering |
| pgvector | Already on PostgreSQL, don't want another service |
| Qdrant | Open source, high performance, Docker-friendly |
Chunking — Breaking Documents Into Retrieval Units
You index chunks, not entire documents. A chunk should be small enough to be relevant to a specific question, and large enough to contain enough context to answer it.
The simplest chunking strategy — fixed size with overlap:
def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
start = 0
while start < len(words):
end = min(start + chunk_size, len(words))
chunks.append(" ".join(words[start:end]))
start += chunk_size - overlap
return chunks
document = "Your long document text here..."
chunks = chunk_text(document)
The overlap ensures that information near a chunk boundary doesn't get split between two chunks and lost in both. A sentence that starts near the end of chunk 1 appears again at the start of chunk 2.
Chunk size is a tuning parameter. Smaller chunks are more precise but may lack context. Larger chunks are more contextual but may match too broadly. 200–500 words is a reasonable starting range for most document types.
---
What Makes RAG Work Well
The quality of a RAG system is almost entirely determined by retrieval quality, not generation quality. The LLM is a capable reasoner — give it the right information and it will produce a good answer. Give it the wrong information (or no information) and no model is smart enough to compensate.
Three things determine retrieval quality:
Chunking strategy. If your chunks split important information across boundaries, or if they're too large to be specific, retrieval will return chunks that are relevant-adjacent but not actually useful. Poor chunking is the most common root cause of RAG failures.
Embedding model. A general-purpose embedding model works for general text. Domain-specific text — legal, medical, financial, code — often benefits from a domain-specific embedding model. If your similarity scores look right but retrieval is still wrong, the embedding model is the first thing to investigate.
Top-K and filtering. How many chunks you retrieve, and whether you filter by metadata before retrieving, determines what the model sees. Too few chunks and you miss relevant information. Too many and you dilute the signal with noise — the model's attention spreads across irrelevant context and accuracy drops.
---
What RAG Cannot Do
Understanding the limits of RAG is as important as understanding how to use it.
RAG does not make the model smarter. If the model can't reason correctly over a piece of information, providing that information via RAG doesn't fix it. RAG solves the knowledge access problem, not the reasoning quality problem.
RAG does not guarantee accuracy. If the wrong chunks are retrieved — either because retrieval failed or because the relevant information doesn't exist in the index — the model will either say it doesn't know (good) or hallucinate based on what it did receive (bad). Good retrieval is the guardrail, not the model.
RAG is not a substitute for structured data. If a user asks "what is the current price of product X?", the answer should come from a database query, not from a document retrieved by similarity search. Use RAG for knowledge that is best expressed in natural language. Use your existing data systems for structured, precise, frequently updated facts.
RAG adds latency. Every query now involves embedding the question, searching the vector database, optionally re-ranking results, and then calling the LLM. Each step adds time. For latency-sensitive applications, budget for this from the start.
---
When to Use RAG
Use RAG when:
Consider alternatives when:
---
The Bigger Picture
RAG is one answer to a foundational question in AI engineering: how do you connect a general-purpose reasoning engine to specific, private, or current information?
The mental model that helps: the LLM is a reasoning engine, not a knowledge store. It is very good at reading a piece of text and drawing conclusions from it. It is unreliable as a database of facts. Design your system accordingly — store knowledge in systems built for knowledge storage (documents, databases, vector stores), and let the LLM do what it's good at: reading that knowledge and reasoning over it.
RAG is the bridge between those two roles. The retrieval system handles the "find the right information" problem. The LLM handles the "make sense of the information and answer the question" problem. Neither does the other's job.
Build that separation cleanly and RAG works well. Blur it and you'll spend weeks debugging failures that are actually architectural.
That's the foundation. The engineering depth comes next.