Before you integrate an LLM into a system, before you write a prompt, before you pick a model — you need a working mental model of what an LLM actually is. Not the marketing version. Not the "it's like a brain" version. The engineering version.
This post is that foundation. If you understand what's in here, you'll debug faster, prompt better, and build more reliable systems than engineers who treat LLMs as black boxes.
---
What an LLM Actually Does
A Large Language Model does one thing: given a sequence of tokens, it predicts the probability distribution of the next token.
That's it. Everything else — reasoning, summarization, code generation, question answering — is an emergent consequence of doing that one thing well, at scale, over a massive and diverse training corpus.
A token is the unit of text the model works with. Tokens are roughly 3–4 characters on average in English, though the exact mapping depends on the tokenizer. "unbelievable" might be two tokens: "unbeli" and "evable". A space before a word often becomes part of the token. Punctuation usually gets its own token.
import anthropic
# Tokens are not words. See how a model counts them.
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1,
messages=[{"role": "user", "content": "Count the tokens in: unbelievable engineering"}]
)
# Always budget for tokens, not characters or words
print(f"Input tokens used: {response.usage.input_tokens}")
The model doesn't "read" your prompt the way a human reads text. It converts your text to a sequence of token IDs, processes them through its layers, and produces a probability distribution over its entire vocabulary for what token should come next. It samples from that distribution. Then it repeats — appending the new token and predicting the next — until it hits a stop condition.
This process is called autoregressive generation. Every token the model generates becomes part of the context for the next token. The model is always completing what came before it.
---
How LLMs Are Built
Understanding the training pipeline demystifies a lot of model behavior that otherwise feels random.
Stage 1 — Pre-training
The model is trained on an enormous corpus of text — books, code, web pages, scientific papers, forums — using a simple objective: predict the next token. This is called self-supervised learning because the labels (the next token) come from the data itself. No human annotation is needed at this stage.
The model learns to predict tokens by compressing statistical patterns from the entire training corpus into its weights. It learns grammar, facts, code syntax, reasoning patterns, and the structure of arguments — all as a side effect of learning to predict text well.
The result is called a base model or foundation model. It knows a lot, but it's not particularly useful in conversation. Ask it a question and it might continue it as if it's part of a longer document, rather than answering it.
Stage 2 — Supervised Fine-Tuning (SFT)
The base model is fine-tuned on curated examples of the behavior you want: good question-answer pairs, helpful conversations, well-structured code. Human annotators write the ideal responses. The model learns to follow the format and style of an assistant.
This is where the model learns to respond rather than just continue.
Stage 3 — Reinforcement Learning from Human Feedback (RLHF)
The model generates multiple responses to the same prompt. Human raters rank them. A separate model (the reward model) is trained to predict human preference scores. Then the LLM is fine-tuned to maximize those scores using reinforcement learning.
This stage is why modern LLMs are helpful, harmless, and honest — the reward model encodes human preferences, and the LLM is trained to satisfy them.
Pre-training SFT RLHF
Raw text corpus → Curated Q&A → Human preference
Predicts tokens Learns format Learns alignment
Base model Instruct model Chat model
Each stage builds on the last. The capabilities come from pre-training. The behavior comes from fine-tuning. The safety properties come from RLHF.
---
The Transformer — What's Inside
Every modern LLM is built on the Transformer architecture. You don't need to implement one, but you should understand what each component does to reason about model behavior.
Tokenization
Before any processing, text is split into tokens using a vocabulary-based tokenizer (BPE, WordPiece, or SentencePiece). Each token maps to an integer ID. This ID is looked up in an embedding table — a learned matrix that maps each token to a dense vector representation.
# Approximate how tokenization works conceptually
vocab = {"hello": 0, "world": 1, "##!": 2}
text = "hello world!"
token_ids = [vocab.get(t, -1) for t in text.lower().split()]
# Embeddings are learned vectors for each ID — not hand-crafted
Attention — How the Model Relates Tokens
The core mechanism in a Transformer is self-attention. For each token, attention computes a weighted sum over all other tokens in the sequence, asking: "which other tokens are most relevant to understanding this one?"
Conceptually:
For each token position:
- Compute a Query vector (what I'm looking for)
- Compute Key vectors for all tokens (what each token offers)
- Score = Query · Key (dot product)
- Softmax(scores) → attention weights
- Output = weighted sum of Value vectors
Attention is what allows the model to understand that "it" in "the server crashed because it ran out of memory" refers to "the server" — not "memory." The token "it" attends heavily to "server" through the learned attention weights.
Multi-head attention runs this process in parallel with different learned projections (heads), letting the model attend to different relationships simultaneously — syntactic, semantic, positional.
Feed-Forward Layers and Residual Connections
After attention, each token's representation passes through a feed-forward network. This is where most of the model's "knowledge storage" happens — the weights encode factual associations learned during pre-training.
Residual connections (skip connections) pass the original input alongside the transformed output at each layer. This stabilizes training and lets information flow across many layers without vanishing.
Layer Depth and Emergent Capability
Modern LLMs stack dozens to hundreds of these attention + feed-forward blocks. GPT-3 has 96 layers. The depth is not cosmetic — capabilities that don't exist in smaller models emerge as models get deeper and wider. Multi-step reasoning, analogy, and code generation appear at specific scale thresholds that aren't predictable from smaller models.
This is why benchmark numbers jump non-linearly with parameter count. You're not just adding more of the same; you're crossing capability thresholds.
---
The Context Window
The context window is the maximum number of tokens the model can process at once — both the input (your prompt) and the output (the generated response) combined.
Modern models have context windows from 8K to 1M+ tokens. This matters for:
# Always track token usage — don't assume you're within limits
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": your_prompt}]
)
print(f"Input: {response.usage.input_tokens} tokens")
print(f"Output: {response.usage.output_tokens} tokens")
print(f"Total: {response.usage.input_tokens + response.usage.output_tokens} tokens")
A critical property: models don't have uniform attention across the context window. Content at the very beginning and very end tends to receive more attention than content buried in the middle. For very long prompts, important instructions belong at the top or bottom — not sandwiched between thousands of tokens of document text.
---
Temperature and Sampling
When the model generates the next token, it has a probability distribution over its entire vocabulary. How it samples from that distribution is controlled by parameters you set.
Temperature scales the distribution before sampling:
temperature = 0: Always pick the highest-probability token. Deterministic, conservative, good for structured outputs.temperature = 1: Sample proportional to the raw probabilities. Balanced creativity and coherence.temperature > 1: Flatten the distribution — lower-probability tokens get sampled more often. More random, often incoherent at extremes.# Classification: deterministic
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=10,
messages=[{"role": "user", "content": classify_prompt}]
)
# Creative writing: more variation
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": creative_prompt}]
)
Top-P (nucleus sampling) restricts sampling to the smallest set of tokens whose cumulative probability exceeds P. top_p=0.9 means only sample from the tokens that together account for 90% of the probability mass, ignoring the long tail of unlikely tokens.
Most production use cases should set temperature to 0 for structured outputs and keep defaults for conversational tasks. Don't tune these parameters without an evaluation suite to measure the effect.
---
What LLMs Know — and Don't Know
LLM knowledge is frozen at the training cutoff. The model doesn't browse the internet. It doesn't know what happened last week. It has no access to your database, your documents, or your users' data unless you put that information in the prompt.
This creates a few categories of unreliable output:
Hallucination — the model produces plausible-sounding but factually incorrect content. This happens because the model is optimized to produce likely text, not accurate text. A confidently wrong answer is, statistically, often a good completion of the conversation so far.
Outdated knowledge — anything after the training cutoff is unknown to the model. It may extrapolate, but it can't reliably distinguish what it knows from what it's guessing.
Knowledge gaps — low-frequency information in the training corpus is poorly learned. Niche topics, private organizations, and highly specialized domains are where hallucination rates spike.
# Force the model to acknowledge uncertainty
system = """Answer based only on the provided context.
If the context does not contain enough information to answer,
respond with: "I don't have enough information to answer this reliably."
Do not use knowledge beyond what is provided."""
The engineering response to these limitations is to inject verified, up-to-date information into the prompt — either via RAG for document retrieval or via tool use for live data. The model then reasons over what you give it, rather than relying on memorized weights.
---
Capabilities That Emerged at Scale
Some LLM capabilities weren't designed — they emerged as models got larger.
In-context learning — the ability to learn a new task from examples in the prompt, without any weight updates. Show the model three examples of a classification task, and it performs the task on the fourth input. This doesn't exist in small models.
Chain-of-thought reasoning — when prompted to reason step by step, large models produce correct answers to multi-step problems that they answer incorrectly when prompted for direct answers. The reasoning tokens are not cosmetic — they materially improve accuracy.
Instruction following — the ability to parse and execute complex, multi-step instructions from natural language. Early language models couldn't do this reliably; it required both scale and fine-tuning.
Code generation — generating syntactically and semantically correct code across dozens of programming languages. This emerges from training on code, combined with scale.
These capabilities are why modern LLMs are useful. They're also why you can't predict LLM performance by extrapolating from small models. Emergent capabilities appear at thresholds, not gradually.
---
Model Families and How to Choose
Different models exist on a capability-cost-latency frontier. Choosing the right model for the task is a real engineering decision.
| Model tier | Use case | Tradeoff |
|---|---|---|
| Large (Opus-class) | Complex reasoning, code review, nuanced analysis | Highest capability, highest cost and latency |
| Mid (Sonnet-class) | Most production tasks — summarization, Q&A, structured output | Best capability-to-cost ratio for general use |
| Small (Haiku-class) | Classification, intent detection, simple extraction | Fastest and cheapest, capability ceiling is real |
def route_model(task_type: str) -> str:
routing = {
"classification": "claude-haiku-4-5-20251001",
"intent_detection": "claude-haiku-4-5-20251001",
"summarization": "claude-sonnet-4-6",
"structured_extraction":"claude-sonnet-4-6",
"code_review": "claude-opus-4-7",
"complex_reasoning": "claude-opus-4-7",
}
return routing.get(task_type, "claude-sonnet-4-6")
Don't default to the largest model for everything. It's a cost and latency decision that compounds at scale.
---
The Mental Model That Matters
Here is the frame that makes everything else click:
An LLM is a very good next-token predictor trained on the statistical structure of human language and thought. It doesn't "know" things the way a database does. It doesn't "reason" the way a theorem prover does. It produces text that, given its training, is statistically likely to be the right continuation of what came before.
When it works well, that statistical likelihood aligns with correctness. When it fails, it's because likely text and correct text diverged.
Your job as an AI engineer is to structure the context — the prompt, the retrieved documents, the tool outputs, the examples — so that the statistically likely continuation is also the correct one.
Everything in prompt engineering, RAG, and agent design is in service of that goal.
Understanding that is the prerequisite for building AI systems that work.