Yudi Nugraha

When engineers first work with LLMs, they tend to treat prompts the way they treat documentation — something you write quickly and move on from. Then the outputs are inconsistent, the model ignores half the instructions, and the results change inexplicably between runs.

The problem is almost always the prompt.

Prompt engineering is the practice of designing inputs to language models so that the outputs are reliable, accurate, and useful. It is not about learning secret phrases. It's about understanding how models process and respond to text, and structuring your inputs in ways that align with that process.

This post covers the foundation. If you're new to working with LLMs, start here.

---

Why Prompts Matter More Than You Think

An LLM has fixed weights after training. It cannot be changed at runtime. The only lever you have is the input — the prompt. Everything about the model's output, within a given inference call, is determined by what you put in.

This makes the prompt the primary engineering surface in any AI system. A well-structured prompt can make a mid-tier model perform like a stronger one. A poorly structured prompt will make the strongest model produce garbage.

Consider these two prompts sent to the same model:

Prompt A:

What are the risks of this code?

Prompt B:

You are a senior security engineer reviewing Python backend code. 
Identify security vulnerabilities in the following code. For each 
vulnerability, state: the vulnerability type, the affected line range, 
the potential impact, and a recommended fix. Focus on OWASP Top 10 risks.

Same model. Completely different outputs. The model in Prompt A produces a vague, generic list. The model in Prompt B produces a structured security analysis. The gap is not capability — it's specification.

---

How the Model Reads Your Prompt

Before writing prompts, understand what the model actually does with them.

The model converts your text into tokens, processes them through its layers, and generates a response that is statistically likely to follow from what you provided. It has no intent, no understanding in the human sense — it produces text that, given its training, is the most probable continuation of your input.

This has three practical consequences.

Vague input produces vague output. If you send "summarize this," the model infers what a summary should look like from the surrounding context and its training distribution. That inference may not match your expectation.

The model fills every gap. Anything you leave unspecified, the model fills with its best guess. Format, length, tone, perspective, level of detail — if you don't specify them, the model decides. You may not like the decision.

Demonstrations beat descriptions. Telling the model what to do in prose is less reliable than showing it. The model was trained on examples of good text, not on your instructions. Examples are a format it has seen before. Instructions in natural language are ambiguous.

These three facts are the engine behind every prompt engineering technique.

---

The Basic Prompt Structure

A well-constructed prompt has four components. You won't always need all four, but knowing them gives you a framework to reason from.

1. Role

Assign the model an identity appropriate to the task. This shapes the prior — the background knowledge and vocabulary the model draws from when generating.

You are a senior data engineer specializing in building ETL pipelines 
on AWS. You have ten years of experience with Python, Airflow, and 
Redshift. You write precise, production-ready code with error handling.

"Helpful assistant" is the default. It's generic. A specific role produces a response with different assumptions, vocabulary, and depth.

2. Task

State exactly what you want. Be specific about scope, depth, and form. Every adjective you leave vague is a decision the model makes for you.

Weak:

Explain database indexing.

Strong:

Explain database indexing to a backend engineer who understands SQL but 
has not worked with query optimization before. Cover: what an index is, 
how B-tree indexes work, when to add an index, and when not to. Limit 
the explanation to four sections, one for each topic. Avoid jargon not 
already defined in the explanation.

The strong version specifies the audience, the structure, the scope, and the constraints. The weak version leaves all of those to the model.

3. Context

Provide the information the model needs that it wouldn't otherwise have — your data, the document to analyze, the code to review, the conversation history, the database schema.

from anthropic import Anthropic

client = Anthropic()

schema = """
Table: orders
Columns: id (int), user_id (int), status (varchar), total (decimal), 
         created_at (timestamp), shipped_at (timestamp, nullable)
"""

user_question = "Which users placed more than 3 orders in the last 30 days?"

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system="You are a SQL expert. Write efficient, readable queries. Always alias columns clearly.",
    messages=[
        {
            "role": "user",
            "content": f"Database schema:\n{schema}\n\nQuestion: {user_question}"
        }
    ]
)

print(response.content[0].text)

The model's output quality is bounded by the quality of the context you provide. Garbage in, garbage out — this applies to prompts as much as to data pipelines.

4. Output Format

Tell the model exactly how to format the response before it starts generating. Models commit to a format early in generation — specifying format at the end of a prompt is less reliable than specifying it upfront.

Respond in the following JSON format:
{
  "summary": "one sentence",
  "risks": ["risk 1", "risk 2"],
  "recommendation": "one paragraph"
}
Do not include any text outside the JSON object.

If you need prose, describe the structure: number of sections, what each covers, and any length constraints. If you need a list, say so. If you need code with no explanation, say that too.

---

The Four Foundational Techniques

These are the techniques every prompt engineer uses daily. Master these before exploring anything more advanced.

Zero-Shot Prompting

The simplest form: give the model a task with no examples. Works well for tasks that are common in the training data — translation, summarization, basic classification, grammar correction.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    messages=[
        {
            "role": "user",
            "content": "Classify the sentiment of this review as positive, neutral, or negative.\n\nReview: 'Delivery was two days late, but the product quality exceeded my expectations.'"
        }
    ]
)

Zero-shot fails on niche tasks, domain-specific classification, or anything where "what good looks like" is not obvious from the task description alone. That's when you add examples.

Few-Shot Prompting

Provide examples of the task before asking the model to perform it. This is the most reliable way to communicate format, tone, and decision criteria that are hard to describe in prose.

few_shot_prompt = """Classify the priority of the following support tickets.
Priority levels: low, medium, high, critical.

Ticket: "App is a bit slow today"
Priority: low

Ticket: "Cannot log in — getting 500 error every time"
Priority: high

Ticket: "Payment processing is completely down"
Priority: critical

Ticket: "Can I change my profile picture?"
Priority: low

Ticket: {ticket_text}
Priority:"""

Rules for good examples:

Cover the range of inputs, not just the easy cases

Keep format identical across every example

Put the most representative example last — it has the most influence on the next token

Role Prompting

Assign the model a specific persona with domain expertise. This is more than a formatting trick — it shifts the vocabulary, assumptions, and analytical frame the model applies.

system_prompts = {
    "security_reviewer": """You are a security engineer with expertise in 
application security and OWASP. You review code for vulnerabilities, 
never for style. You cite specific attack vectors and CVEs where relevant.""",

    "technical_writer": """You are a technical writer with ten years of 
experience writing API documentation. You write for developers. You are 
precise, avoid adjectives that don't add information, and always include 
concrete examples.""",

    "sql_expert": """You are a database engineer specializing in query 
optimization. You write efficient SQL, explain query plans when relevant, 
and always consider index usage."""
}

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_prompts["security_reviewer"],
    messages=[{"role": "user", "content": f"Review this code:\n\n{code_snippet}"}]
)

Use role prompting any time the output should reflect a specific domain or perspective. Don't use it when the generic assistant default is sufficient.

Chain-of-Thought Prompting

Ask the model to reason through the problem step by step before producing its answer. This consistently improves accuracy on tasks that require multi-step reasoning — logic, math, debugging, complex classification.

cot_prompt = """You are analyzing a production incident. 
Think through this step by step.

Error: DatabaseConnectionPool exhausted — all 50 connections in use
Timestamp: 2026-05-06 14:32:18 UTC
Recent deployments: None in the last 72 hours
Traffic: Normal — no spike detected

Step 1: What component is failing?
Step 2: What could cause all connections to be held?
Step 3: What evidence do we have or lack?
Step 4: What is the most likely root cause?
Step 5: What immediate mitigation would you recommend?"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": cot_prompt}]
)

Why it works: the reasoning steps the model produces create context that makes the final conclusion more likely to be correct. The model is not just answering — it's building scaffolding that constrains the answer.

Add "Let's think step by step" to any complex prompt and measure the accuracy difference. It's one of the highest-leverage prompt modifications you can make.

---

What to Specify and What to Leave Open

A common mistake is under-specifying some things and over-specifying others.

Always specify:

Output format (JSON, list, prose, code)

Output length (maximum words, number of bullet points, single sentence)

Scope (what to include, what to exclude)

Audience (who the response is for)

What to do when the task can't be completed ("say you don't know" vs. "give your best guess")

Leave open:

The exact phrasing of the response — constraining word choice rarely helps

Stylistic details beyond what you genuinely care about — every constraint narrows the search space and may trade one quality for another

Things the model will infer correctly from context — over-specifying adds noise

The goal is to remove ambiguity about outcomes, not to control every word.

---

Prompt Iteration — How to Improve Systematically

Prompt engineering is iterative. First attempts are rarely optimal. The discipline is in how you iterate.

Step 1 — Define the success criteria first. Before writing any prompt, know what a good output looks like. Without that, you'll iterate by feeling instead of by signal.

Step 2 — Write the simplest prompt that could work. Start with zero-shot. If it's good enough, stop. Add complexity only when you need it.

Step 3 — Identify specific failure modes. When the output is wrong, name exactly what's wrong. "The response is too long," "It uses bullet points when I need prose," "It misclassifies edge cases." Each failure points to a specific fix.

Step 4 — Change one thing at a time. Multiple simultaneous changes make it impossible to know what worked. Treat prompt iteration like A/B testing.

Step 5 — Test on diverse inputs. A prompt that works on three clean examples may fail on messy real-world input. Test with edge cases, unusual phrasing, and off-topic inputs before shipping.

test_cases = [
    {"input": "Great product, fast shipping",     "expected": "positive"},
    {"input": "Works okay I guess",               "expected": "neutral"},
    {"input": "Complete waste of money",          "expected": "negative"},
    {"input": "Fast shipping but product broke",  "expected": "negative"},  # edge case
    {"input": "...",                              "expected": "neutral"},    # ambiguous
]

def evaluate_prompt(prompt_template: str, cases: list[dict]) -> float:
    correct = 0
    for case in cases:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=10,
            messages=[{"role": "user", "content": prompt_template.format(text=case["input"])}]
        )
        output = response.content[0].text.strip().lower()
        if case["expected"] in output:
            correct += 1
    return correct / len(cases)

Run this on every prompt variant. The prompt with the higher pass rate ships.

---

Common Mistakes to Avoid Early

Prompt creep — adding more instructions every time the model misbehaves until the prompt is 2,000 words of contradictory rules. Instead, restructure. A long prompt with many constraints is often worse than a shorter, cleaner one.

Assuming the model remembers — within a single API call, the model has no memory of previous calls. Every request is stateless. If you need context from a previous turn, you must include it explicitly.

Testing on your own examples — you write examples that represent inputs you imagine. Real users send inputs you didn't imagine. Evaluate on real traffic as early as possible.

Treating a working prompt as permanent — model updates, changes in usage patterns, and new edge cases all erode prompt quality over time. Treat prompts as living artifacts that need maintenance, not configuration that you set once.

Ignoring temperature — the default temperature works for most tasks, but classification and structured output should use temperature=0 for determinism. Many engineers never set it explicitly and get unnecessary variance.

---

The Prompt Engineering Mindset

The most useful shift in thinking: you are writing a specification, not a request.

A request is conversational — you describe what you want and hope the other party fills in the gaps reasonably. A specification is exhaustive — it defines the output, the constraints, the edge cases, and the failure behavior.

LLMs respond to specifications better than requests. Not because they're machines — they are, technically — but because every ambiguity in your specification is a degree of freedom the model uses in ways you can't predict.

The goal of prompt engineering is to close those degrees of freedom deliberately, so the model's outputs are predictable, consistent, and correct across the full distribution of inputs your system will encounter.

That's the whole discipline. Structure your input so the correct output is also the most likely output. Once that clicks, everything else in prompt engineering is detail.

Introduction to Prompt Engineering — The Engineer's Starting Point