Glossary

Prompt Engineering

Prompt engineering is the practice of designing and refining the instructions you give to an AI model to get more accurate, consistent, and useful outputs. It includes techniques like providing examples, setting roles, specifying formats, and breaking complex tasks into steps.

How It Works

The way you ask an AI model to do something changes what you get back. Prompt engineering is the discipline of figuring out what works. It's the cheapest and fastest way to improve AI output quality before reaching for more complex techniques like fine-tuning or RAG.

Basic techniques include: giving the model a role ("You are a senior compliance analyst"), providing examples of desired output (few-shot prompting, typically 2-5 examples), specifying the exact format you want (JSON with a schema, bullet points, a specific template), and asking the model to think step by step before answering (chain-of-thought prompting).

More advanced techniques include breaking a complex task into smaller sub-tasks handled by separate prompts, using structured prompts with clearly labeled sections (XML tags work particularly well with Claude), and adding constraints that prevent common failure modes. Telling the model "if you're not sure, say so instead of guessing" measurably reduces hallucination rates in production RAG systems. Self-consistency prompting, where the same question is run multiple times and the majority answer is returned, improves reliability on reasoning tasks at 3-5x the cost.

In production systems, prompts are treated as code. They get version-controlled, tested, and reviewed. A small change in wording can significantly affect output quality, so teams maintain prompt libraries, use prompt-management tools like PromptLayer, LangSmith, or Humanloop, and run offline evaluations before deploying prompt changes. Treating prompts as untested strings embedded in application code is how teams end up with regressions nobody noticed.

Prompt engineering has limits. When you need the model to have specific knowledge it wasn't trained on, you need RAG. When you need it to behave consistently in a way that prompting alone can't achieve, you need fine-tuning. But for most use cases, a well-engineered prompt gets you 80% of the way there.

The other limit: prompt engineering is model-specific. A prompt tuned for GPT-4o may behave differently on Claude Sonnet, and both will differ from Gemini 2.0. Switching base models usually requires re-evaluating prompts against your test set. DSPy and automated prompt optimization tools (like OPRO or APE) try to make this less painful, but the tax is real. Budget for it in migration plans.

In Practice

Prompt engineering in production centers on a few workflows. Prompt management uses tools like LangSmith, PromptLayer, Humanloop, or a simple Git-tracked YAML directory. Evaluation is handled via LangSmith evals, Promptfoo, or custom test runners that run prompts against labeled datasets and score outputs on accuracy, format compliance, and safety. Automated optimization frameworks like DSPy (from Stanford) treat prompts as programs and optimize them against a metric.

Typical configuration: system prompts between 200 and 1500 tokens, 2-5 few-shot examples for structured tasks, XML-tagged sections for Claude (<instructions>, <context>, <example>), Markdown sections for GPT and Gemini. Temperature settings commonly sit at 0 for deterministic tasks (extraction, classification) and 0.3-0.7 for generation tasks. Output format is increasingly enforced via structured outputs (OpenAI response_format, Anthropic tool-use patterns) rather than free-text instructions.

A working prompt-iteration workflow. Start with a labeled test set of 50-200 examples representative of production. Write the first prompt. Run it against the test set and measure accuracy, format compliance, and any task-specific metric. Commit the prompt and eval results together. Iterate by changing one variable at a time: role, examples, constraints, output format. Never ship a prompt change that regresses on the test set even if it looks better on anecdotal cases. Prompt changes that improve on one benchmark often regress on others, which is why regression testing matters as much for prompts as for code.

Worked Example

An e-commerce retailer uses an LLM to extract structured product attributes (color, size, material, brand) from free-text supplier descriptions. The first prompt is a simple "Extract color, size, material, and brand from this description and return JSON." Claude Haiku gets 78% field-level accuracy on a labeled test set of 400 descriptions.

The team iterates. Version 2 adds a role ("You are a product data specialist for a fashion retailer") and raises accuracy to 81%. Version 3 adds 4 few-shot examples covering tricky cases (multi-color garments, size ranges, rare materials). Accuracy jumps to 87%. Version 4 wraps the input in XML tags and explicitly instructs the model to return null for missing fields instead of guessing, which cuts the false-positive rate on "material" from 14% to 3%. Accuracy now 91%.

Version 5 moves from free-text JSON to structured outputs using Anthropic's tool-use pattern with a typed schema. This eliminates format errors entirely and pushes parseable output to 100%, with accuracy stable at 91%. The team stops iterating and deploys version 5 to production, saving each prompt version plus its eval scores in their prompt repo. When they later test Claude Sonnet for a quality lift, the same evaluation pipeline measures it against version 5 on the same test set. Sonnet hits 94% accuracy. The team evaluates whether the quality bump justifies the 4x cost increase for that workload.

What People Get Wrong

Myth

Better prompts can fix a model that's fundamentally wrong for the task.

Reality

Prompt engineering improves a capable model's output. It doesn't make an incapable model capable. If Claude Haiku can't reliably do multi-step math reasoning, no amount of prompting will fix that. Switching to a stronger model or adding a code-execution tool will. Know the ceiling of each model and stop polishing prompts when you've hit it.

Myth

Chain-of-thought prompting always improves results.

Reality

It helps on tasks that genuinely need multi-step reasoning (math, logic, planning). For simple classification or extraction tasks, chain-of-thought adds tokens and latency without quality gains. Measure both. Don't apply CoT by default. Also, reasoning models (o1, o3, Claude with extended thinking) do internal CoT already, so explicit prompting may be redundant or counterproductive with them.

Myth

Longer, more detailed prompts produce better outputs.

Reality

Up to a point. Beyond that, long prompts dilute key instructions, suffer from the lost-in-the-middle effect, and cost more per call. The best prompts are often surprisingly short: clear role, clear task, 2-5 examples, clear output format. If your prompt is 3,000 tokens and performance is poor, shorter usually beats longer. Test it.