The phrase "prompt engineering" has a marketing problem. It sounds like a trick — like you need to know the magic words to unlock a better answer. That framing is wrong, and it produces engineers who iterate randomly instead of reasoning about what the model needs.
Prompt engineering is closer to writing a tight technical specification. The model is a capable contractor who will do exactly what you describe — including the parts you forgot to describe, which it will fill in with its own best guess. Your job is to close those gaps.
This is a field guide to doing that systematically.
---
How to Think About What a Prompt Does
Before touching a prompt, understand what the model is actually doing when it receives one. Transformers predict tokens based on prior context. Everything in your prompt is context that biases what comes next. This has practical implications:
None of this is mystical. It's a consequence of how the model was trained.
---
The Anatomy of a Good Prompt
A well-structured prompt has distinct layers, each doing a specific job.
Role and Context
Give the model a stable identity and the context it needs to do the task. Not "you are a helpful assistant" — that's the default. Give it domain-specific expertise and relevant situational context.
system_prompt = """You are a senior backend engineer at a fintech company.
You review pull requests and give precise, actionable feedback focused on
correctness, security, and performance. You do not comment on style unless
it causes confusion. You cite specific line numbers."""
The role shapes the prior. A prompt prefixed with "you are a senior backend engineer" will generate responses with different vocabulary, assumptions, and depth than one without it. Use that.
Task Definition
Be exact. Ambiguity in the task description gets filled with the model's prior — which may not match yours.
Weak:
Summarize this document.
Strong:
Summarize this document in exactly three bullet points. Each bullet should
capture one distinct finding. Use plain language a non-technical stakeholder
can follow. Do not include recommendations — findings only.
Every vague word is a place where the output can vary in ways you don't want. "Concise" means different things to different people. "Three bullet points of at most 20 words each" does not.
Output Format
Specify the format before the model starts generating. Models commit to a format early — telling it the desired format after it has already started prose is unreliable.
user_message = """Analyze the following customer feedback and respond in this
exact JSON format:
{
"sentiment": "positive | neutral | negative",
"topics": ["list", "of", "topics"],
"urgency": "low | medium | high",
"suggested_action": "one sentence"
}
Feedback: {feedback_text}"""
If you need prose, describe the structure: number of paragraphs, what each covers, max length. If you need code, specify the language, imports allowed, whether tests are included.
Constraints and Guardrails
Say what you don't want, not just what you do. Models are pattern completers — they'll add what feels natural unless you close the door.
constraints = """
Do NOT:
- Add caveats like "however" or "keep in mind that"
- Repeat information from the input
- Use passive voice
- Add a concluding "in summary" paragraph
DO:
- Use present tense
- Cite specific examples from the provided data
- If you are uncertain, say so explicitly rather than hedging in prose
"""
This sounds verbose. It is. That's appropriate — the model will do exactly what the context suggests, and your constraints are the authoritative signal that overrides its defaults.
---
Few-Shot Prompting — The Most Reliable Tool You Have
If you take one thing from this post, make it this: examples are more reliable than instructions.
When you describe a behavior in prose, the model interprets the description. When you show the behavior in examples, the model matches the pattern. Pattern matching is what it was trained on. It's what it's good at.
few_shot_prompt = """Classify the intent of the following support messages.
Respond with only the intent label.
Message: "I can't log in to my account"
Intent: authentication_issue
Message: "When will my order arrive?"
Intent: order_status
Message: "I want to cancel my subscription"
Intent: cancellation_request
Message: "The app crashes every time I open it"
Intent: bug_report
Message: {user_message}
Intent:"""
Rules for good few-shot examples:
---
Chain of Thought — When to Use It
Chain-of-thought (CoT) prompting asks the model to reason step by step before producing a final answer. It consistently improves performance on tasks that require multi-step reasoning: math, logic, complex classification, code debugging.
cot_prompt = """You are debugging a production issue. Think through this
step by step before giving your conclusion.
Error log:
{error_log}
Step 1: Identify what component threw the error.
Step 2: Identify what condition triggered it.
Step 3: Trace back to the root cause.
Step 4: State your conclusion and recommended fix."""
Why it works: forcing the model to produce intermediate reasoning steps creates a context in which the final answer is more likely to be correct. The reasoning tokens are scaffolding — they prime the generation of a better answer.
When not to use it:
For production use, you often want the conclusion without the visible reasoning. Use thinking-enabled models (or a separate reasoning step that you discard) rather than making the chain of thought visible in the final output.
---
Prompt Decomposition — Break Complex Tasks Apart
When a single prompt tries to do too much, quality degrades. The model is attempting to optimize across competing objectives simultaneously.
The fix is decomposition: break the task into a pipeline of focused prompts, each responsible for one thing.
from anthropic import Anthropic
client = Anthropic()
def extract_claims(document: str) -> list[str]:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="Extract factual claims from the document. Return one claim per line. Claims only — no commentary.",
messages=[{"role": "user", "content": document}]
)
return response.content[0].text.strip().split("\n")
def verify_claim(claim: str, sources: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
system='Respond only with JSON: {"verdict": "supported|unsupported|inconclusive", "reason": "one sentence"}',
messages=[{"role": "user", "content": f"Claim: {claim}\n\nSources:\n{sources}"}]
)
import json
return json.loads(response.content[0].text)
def fact_check_document(document: str, sources: str) -> list[dict]:
claims = extract_claims(document)
return [
{"claim": claim, **verify_claim(claim, sources)}
for claim in claims
if claim.strip()
]
Each prompt is testable in isolation. Each prompt can be improved without affecting the others. The pipeline is debuggable — you can inspect intermediate outputs to see exactly where it breaks down.
This is just good engineering applied to prompts.
---
Controlling Output Quality
Temperature
Temperature controls the randomness of token sampling. Lower temperature = more deterministic, more conservative. Higher = more varied, more creative.
Practical defaults:
0.0 — structured outputs, classification, data extraction. You want the same answer every time.0.3–0.5 — analysis, summarization. Some variation is fine; accuracy matters more.0.7–1.0 — creative writing, brainstorming, idea generation. Variation is the point.Don't guess temperature. Set it based on the task type and measure output quality at different values with your eval suite.
Max Tokens
Always set max_tokens explicitly. The default varies by model and is often too high for constrained outputs. A classification endpoint that should return one word doesn't need 4096 tokens of headroom — and leaving it unbounded invites the model to ramble.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=50, # intent classification — one label
temperature=0,
messages=[{"role": "user", "content": classification_prompt}]
)
Stop Sequences
Use stop sequences to halt generation at a predictable boundary — especially useful when the model should produce structured text that you'll parse.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
stop_sequences=["###END###"],
messages=[{"role": "user", "content": f"{prompt}\n\nEnd your response with ###END###"}]
)
---
Testing and Evaluating Prompts
A prompt without an eval is a guess. You need a systematic way to know whether a prompt change is an improvement.
Build a Regression Suite First
Before shipping any prompt, assemble 20–50 representative inputs with expected outputs. Diverse inputs matter: clean cases, edge cases, adversarial inputs, off-domain inputs.
from dataclasses import dataclass
from typing import Any, Callable
@dataclass
class PromptEvalCase:
input: str
expected: Any
description: str
def evaluate_prompt(
cases: list[PromptEvalCase],
prompt_fn: Callable[[str], str],
score_fn: Callable[[Any, str], float]
) -> dict:
results = []
for case in cases:
output = prompt_fn(case.input)
score = score_fn(case.expected, output)
results.append({
"description": case.description,
"score": score,
"output": output,
"passed": score >= 0.8
})
passed = sum(1 for r in results if r["passed"])
return {
"pass_rate": passed / len(results),
"mean_score": sum(r["score"] for r in results) / len(results),
"failures": [r for r in results if not r["passed"]]
}
Run this on every prompt change, including wording changes that seem trivial. Small prompt changes have non-local effects.
Compare Prompts Systematically
Never compare two prompt variants by eyeballing a few outputs. Run both on the full eval suite and compare metrics.
def compare_prompts(cases, prompt_a, prompt_b, score_fn):
result_a = evaluate_prompt(cases, prompt_a, score_fn)
result_b = evaluate_prompt(cases, prompt_b, score_fn)
print(f"Prompt A — pass rate: {result_a['pass_rate']:.0%}, mean score: {result_a['mean_score']:.2f}")
print(f"Prompt B — pass rate: {result_b['pass_rate']:.0%}, mean score: {result_b['mean_score']:.2f}")
if result_b["pass_rate"] > result_a["pass_rate"]:
print("→ Prompt B wins")
else:
print("→ Prompt A wins (or no meaningful difference)")
If you don't have ground truth labels, use an LLM-as-judge — a separate model call that scores each output against a rubric. It's imperfect, but it's directionally correct and far better than intuition.
---
Common Mistakes That Cost You Weeks
Prompt drift. You iterate a prompt in production until it works, but nobody writes down why it's structured the way it is. Six months later, someone "cleans it up" and breaks everything. Treat prompts like code — they belong in version control with comments explaining non-obvious decisions.
Testing on clean inputs only. Your eval suite reflects inputs you imagined, not inputs users will actually send. Real inputs are messy, ambiguous, and off-topic. Build adversarial cases deliberately.
Ignoring model version. Prompt behavior varies between model versions. A prompt tuned on one version may degrade on the next. Pin model versions in production and run your eval suite before upgrading.
Conflating prompt quality with model capability. A bad output might be a bad prompt, not a model limitation. Before concluding the model can't do something, try five different prompt formulations. The failure mode and the fix are often not obvious from the first attempt.
No budget for failure. Every prompt will occasionally produce bad output. Design downstream systems to handle it — validation, fallbacks, human review queues — rather than assuming the model is always right.
---
The Discipline
Prompt engineering is not about finding clever phrasing. It's about reducing ambiguity, providing relevant context, demonstrating desired behavior, and measuring outcomes.
Every good prompt is precise about what it wants, shows the model what that looks like, closes the doors to outputs you don't want, and is tested against real inputs before it ships.
That's the whole discipline. The rest is just practice.