Yudi Nugraha

The phrase "prompt engineering" has a marketing problem. It sounds like a trick — like you need to know the magic words to unlock a better answer. That framing is wrong, and it produces engineers who iterate randomly instead of reasoning about what the model needs.

Prompt engineering is closer to writing a tight technical specification. The model is a capable contractor who will do exactly what you describe — including the parts you forgot to describe, which it will fill in with its own best guess. Your job is to close those gaps.

This is a field guide to doing that systematically.

---

How to Think About What a Prompt Does

Before touching a prompt, understand what the model is actually doing when it receives one. Transformers predict tokens based on prior context. Everything in your prompt is context that biases what comes next. This has practical implications:

Position matters. Instructions placed near the end of a long prompt have more recency weight than instructions buried in the middle.

Contradictions are silently resolved. If your system prompt says "be concise" and your user prompt says "give a detailed breakdown," the model will pick one. It won't tell you there's a conflict.

Demonstrations outweigh instructions. Showing the model what you want in examples is more reliable than describing what you want in prose.

The model completes patterns. If your prompt looks like a certain type of text, the model will produce what would naturally come next in that type of text.

None of this is mystical. It's a consequence of how the model was trained.

---

The Anatomy of a Good Prompt

A well-structured prompt has distinct layers, each doing a specific job.

Role and Context

Give the model a stable identity and the context it needs to do the task. Not "you are a helpful assistant" — that's the default. Give it domain-specific expertise and relevant situational context.

system_prompt = """You are a senior backend engineer at a fintech company. 
You review pull requests and give precise, actionable feedback focused on 
correctness, security, and performance. You do not comment on style unless 
it causes confusion. You cite specific line numbers."""

The role shapes the prior. A prompt prefixed with "you are a senior backend engineer" will generate responses with different vocabulary, assumptions, and depth than one without it. Use that.

Task Definition

Be exact. Ambiguity in the task description gets filled with the model's prior — which may not match yours.

Weak:

Summarize this document.

Strong:

Summarize this document in exactly three bullet points. Each bullet should 
capture one distinct finding. Use plain language a non-technical stakeholder 
can follow. Do not include recommendations — findings only.

Every vague word is a place where the output can vary in ways you don't want. "Concise" means different things to different people. "Three bullet points of at most 20 words each" does not.

Output Format

Specify the format before the model starts generating. Models commit to a format early — telling it the desired format after it has already started prose is unreliable.

user_message = """Analyze the following customer feedback and respond in this 
exact JSON format:
{
  "sentiment": "positive | neutral | negative",
  "topics": ["list", "of", "topics"],
  "urgency": "low | medium | high",
  "suggested_action": "one sentence"
}

Feedback: {feedback_text}"""

If you need prose, describe the structure: number of paragraphs, what each covers, max length. If you need code, specify the language, imports allowed, whether tests are included.

Constraints and Guardrails

Say what you don't want, not just what you do. Models are pattern completers — they'll add what feels natural unless you close the door.

constraints = """
Do NOT:
- Add caveats like "however" or "keep in mind that"
- Repeat information from the input
- Use passive voice
- Add a concluding "in summary" paragraph

DO:
- Use present tense
- Cite specific examples from the provided data
- If you are uncertain, say so explicitly rather than hedging in prose
"""

This sounds verbose. It is. That's appropriate — the model will do exactly what the context suggests, and your constraints are the authoritative signal that overrides its defaults.

---

Few-Shot Prompting — The Most Reliable Tool You Have

If you take one thing from this post, make it this: examples are more reliable than instructions.

When you describe a behavior in prose, the model interprets the description. When you show the behavior in examples, the model matches the pattern. Pattern matching is what it was trained on. It's what it's good at.

few_shot_prompt = """Classify the intent of the following support messages.
Respond with only the intent label.

Message: "I can't log in to my account"
Intent: authentication_issue

Message: "When will my order arrive?"
Intent: order_status

Message: "I want to cancel my subscription"
Intent: cancellation_request

Message: "The app crashes every time I open it"
Intent: bug_report

Message: {user_message}
Intent:"""

Rules for good few-shot examples:

Cover the distribution. Your examples should represent the range of inputs you'll see in production, including edge cases. If all your examples are clean inputs, the model won't know what to do with messy ones.

Be consistent. Format, tone, and structure in every example should match what you want in the output. The model learns from all of it.

More is usually better — up to a point. 5–10 examples is a significant improvement over 1–2. Beyond ~20, the gains flatten and you're burning context window.

Order matters. The last example before the live input has the most influence. Put a strong representative example there.

---

Chain of Thought — When to Use It

Chain-of-thought (CoT) prompting asks the model to reason step by step before producing a final answer. It consistently improves performance on tasks that require multi-step reasoning: math, logic, complex classification, code debugging.

cot_prompt = """You are debugging a production issue. Think through this 
step by step before giving your conclusion.

Error log:
{error_log}

Step 1: Identify what component threw the error.
Step 2: Identify what condition triggered it.
Step 3: Trace back to the root cause.
Step 4: State your conclusion and recommended fix."""

Why it works: forcing the model to produce intermediate reasoning steps creates a context in which the final answer is more likely to be correct. The reasoning tokens are scaffolding — they prime the generation of a better answer.

When not to use it:

Simple classification tasks — CoT adds latency with no accuracy benefit

High-volume endpoints where latency matters — reasoning tokens cost time and money

Tasks where the intermediate reasoning leaks sensitive information

For production use, you often want the conclusion without the visible reasoning. Use thinking-enabled models (or a separate reasoning step that you discard) rather than making the chain of thought visible in the final output.

---

Prompt Decomposition — Break Complex Tasks Apart

When a single prompt tries to do too much, quality degrades. The model is attempting to optimize across competing objectives simultaneously.

The fix is decomposition: break the task into a pipeline of focused prompts, each responsible for one thing.

from anthropic import Anthropic

client = Anthropic()

def extract_claims(document: str) -> list[str]:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="Extract factual claims from the document. Return one claim per line. Claims only — no commentary.",
        messages=[{"role": "user", "content": document}]
    )
    return response.content[0].text.strip().split("\n")

def verify_claim(claim: str, sources: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        system='Respond only with JSON: {"verdict": "supported|unsupported|inconclusive", "reason": "one sentence"}',
        messages=[{"role": "user", "content": f"Claim: {claim}\n\nSources:\n{sources}"}]
    )
    import json
    return json.loads(response.content[0].text)

def fact_check_document(document: str, sources: str) -> list[dict]:
    claims = extract_claims(document)
    return [
        {"claim": claim, **verify_claim(claim, sources)}
        for claim in claims
        if claim.strip()
    ]

Each prompt is testable in isolation. Each prompt can be improved without affecting the others. The pipeline is debuggable — you can inspect intermediate outputs to see exactly where it breaks down.

This is just good engineering applied to prompts.

---

Controlling Output Quality

Temperature

Temperature controls the randomness of token sampling. Lower temperature = more deterministic, more conservative. Higher = more varied, more creative.

Practical defaults:

0.0 — structured outputs, classification, data extraction. You want the same answer every time.

0.3–0.5 — analysis, summarization. Some variation is fine; accuracy matters more.

0.7–1.0 — creative writing, brainstorming, idea generation. Variation is the point.

Don't guess temperature. Set it based on the task type and measure output quality at different values with your eval suite.

Max Tokens

Always set max_tokens explicitly. The default varies by model and is often too high for constrained outputs. A classification endpoint that should return one word doesn't need 4096 tokens of headroom — and leaving it unbounded invites the model to ramble.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=50,   # intent classification — one label
    temperature=0,
    messages=[{"role": "user", "content": classification_prompt}]
)

Stop Sequences

Use stop sequences to halt generation at a predictable boundary — especially useful when the model should produce structured text that you'll parse.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    stop_sequences=["###END###"],
    messages=[{"role": "user", "content": f"{prompt}\n\nEnd your response with ###END###"}]
)

---

Testing and Evaluating Prompts

A prompt without an eval is a guess. You need a systematic way to know whether a prompt change is an improvement.

Build a Regression Suite First

Before shipping any prompt, assemble 20–50 representative inputs with expected outputs. Diverse inputs matter: clean cases, edge cases, adversarial inputs, off-domain inputs.

from dataclasses import dataclass
from typing import Any, Callable

@dataclass
class PromptEvalCase:
    input: str
    expected: Any
    description: str

def evaluate_prompt(
    cases: list[PromptEvalCase],
    prompt_fn: Callable[[str], str],
    score_fn: Callable[[Any, str], float]
) -> dict:
    results = []
    for case in cases:
        output = prompt_fn(case.input)
        score = score_fn(case.expected, output)
        results.append({
            "description": case.description,
            "score": score,
            "output": output,
            "passed": score >= 0.8
        })

    passed = sum(1 for r in results if r["passed"])
    return {
        "pass_rate": passed / len(results),
        "mean_score": sum(r["score"] for r in results) / len(results),
        "failures": [r for r in results if not r["passed"]]
    }

Run this on every prompt change, including wording changes that seem trivial. Small prompt changes have non-local effects.

Compare Prompts Systematically

Never compare two prompt variants by eyeballing a few outputs. Run both on the full eval suite and compare metrics.

def compare_prompts(cases, prompt_a, prompt_b, score_fn):
    result_a = evaluate_prompt(cases, prompt_a, score_fn)
    result_b = evaluate_prompt(cases, prompt_b, score_fn)

    print(f"Prompt A — pass rate: {result_a['pass_rate']:.0%}, mean score: {result_a['mean_score']:.2f}")
    print(f"Prompt B — pass rate: {result_b['pass_rate']:.0%}, mean score: {result_b['mean_score']:.2f}")

    if result_b["pass_rate"] > result_a["pass_rate"]:
        print("→ Prompt B wins")
    else:
        print("→ Prompt A wins (or no meaningful difference)")

If you don't have ground truth labels, use an LLM-as-judge — a separate model call that scores each output against a rubric. It's imperfect, but it's directionally correct and far better than intuition.

---

Common Mistakes That Cost You Weeks

Prompt drift. You iterate a prompt in production until it works, but nobody writes down why it's structured the way it is. Six months later, someone "cleans it up" and breaks everything. Treat prompts like code — they belong in version control with comments explaining non-obvious decisions.

Testing on clean inputs only. Your eval suite reflects inputs you imagined, not inputs users will actually send. Real inputs are messy, ambiguous, and off-topic. Build adversarial cases deliberately.

Ignoring model version. Prompt behavior varies between model versions. A prompt tuned on one version may degrade on the next. Pin model versions in production and run your eval suite before upgrading.

Conflating prompt quality with model capability. A bad output might be a bad prompt, not a model limitation. Before concluding the model can't do something, try five different prompt formulations. The failure mode and the fix are often not obvious from the first attempt.

No budget for failure. Every prompt will occasionally produce bad output. Design downstream systems to handle it — validation, fallbacks, human review queues — rather than assuming the model is always right.

---

The Discipline

Prompt engineering is not about finding clever phrasing. It's about reducing ambiguity, providing relevant context, demonstrating desired behavior, and measuring outcomes.

Every good prompt is precise about what it wants, shows the model what that looks like, closes the doors to outputs you don't want, and is tested against real inputs before it ships.

That's the whole discipline. The rest is just practice.

Prompt Engineering in Practice — A Field Guide for AI Engineers