Sunday, April 5

Most developers treating few-shot and zero-shot prompting as a simple toggle — “add examples when it doesn’t work, remove them to save tokens” — are leaving both quality and money on the table. The reality is more precise than that, and getting it right matters when you’re running thousands of inferences daily. Few-shot zero-shot prompting decisions compound: a wrong call on a high-volume pipeline costs real money and produces measurably worse outputs.

This article gives you the actual decision framework: when examples statistically move the needle, when they’re noise, and how to instrument your own tests rather than trusting generic advice. I’ll cover Claude specifically, but the core mechanics apply across modern frontier models.

What Zero-Shot and Few-Shot Actually Mean at the Inference Level

Zero-shot: you describe the task in natural language, Claude infers the required behavior from training alone. Few-shot: you prepend labeled input-output examples before your actual request, steering the model toward a specific output distribution.

The mechanism matters. Few-shot examples don’t “teach” the model anything — weights don’t update. What they do is shift the probability distribution of the next token by providing in-context evidence of what the desired output looks like. This is why the quality and relevance of your examples matter far more than the raw count.

Claude 3 Sonnet and Haiku handle few-shot prompts well, but Claude 3.5 Sonnet and Claude 3 Opus are particularly good at generalizing from even 1-2 high-quality examples. The flip side: they’re also better at following zero-shot instructions precisely, which is why the “just add examples” reflex is often wrong with newer models.

Token Cost of Few-Shot on Claude

Before going further, here’s the practical cost reality. Using Claude 3.5 Haiku (currently $0.80/MTok input, $4/MTok output), a three-shot prompt with ~300 tokens of examples adds roughly $0.00024 per call. At 10,000 calls/day, that’s $2.40/day, $876/year — just for the examples. On Sonnet 3.5 at $3/MTok input, the same examples cost $0.0009 per call, or ~$3,285/year at that volume. Not existential, but worth measuring against actual quality gains.

When Few-Shot Prompting Genuinely Helps

There are four categories where examples produce measurable improvement over zero-shot:

1. Custom Output Formats the Model Hasn’t Seen

If you need output in a proprietary schema — say, a specific JSON structure with unusual field names, a custom markdown dialect, or a company-specific report template — zero-shot often produces structurally correct but semantically wrong output. The model knows JSON; it doesn’t know your JSON.

In my testing on structured data extraction tasks (pulling fields from unstructured B2B emails into a CRM schema), three well-chosen examples reduced format errors from ~18% to ~3% on Claude 3 Haiku. Zero-shot with a detailed schema description got it to ~9%. Few-shot won decisively here.

This aligns with the broader challenge of reducing hallucinations in production — examples anchor the model to a known-correct output shape, which reduces the surface area for drift.

2. Tone and Style Calibration

Describing a tone in words (“professional but casual, avoid corporate jargon”) is ambiguous — Claude will interpret that differently than you intend. Showing three examples of your actual brand voice removes the ambiguity. For content pipelines where consistency matters at scale, this is the strongest use case for few-shot.

3. Classification with Non-Standard Labels or Edge Cases

Standard sentiment analysis? Zero-shot is fine. A custom support ticket triage with categories like “billing-dispute-latent-churn-risk” vs. “billing-dispute-resolved”? You need examples to demonstrate where the decision boundary lives. Even 2-3 edge case examples can shift accuracy meaningfully on ambiguous classes.

4. Complex Reasoning with Domain-Specific Constraints

Tasks involving legal, financial, or scientific reasoning that require domain-specific heuristics benefit from examples that demonstrate the reasoning chain, not just the output. This is related to chain-of-thought prompting but distinct — you’re showing what the right answer looks like, not just how to think. For more on combining these techniques, the comparison of chain-of-thought vs role prompting vs constitutional AI is worth reading.

When Zero-Shot Outperforms (or Ties) Few-Shot

This is the part most tutorials skip, which is why it’s worth spending time on.

Well-Defined Tasks Claude Already Knows Well

Summarization, translation, basic code explanation, grammar correction — Claude 3.5 Sonnet handles these with precise zero-shot instructions. Adding examples often doesn’t move accuracy but does add tokens and occasionally introduces anchoring bias (the model mimics the format of your example even when a better format exists for the specific input).

I ran 200 summarization tasks on Claude 3.5 Haiku comparing zero-shot with a detailed instruction (“summarize in 3 bullet points, each under 20 words, focus on actionable insights”) vs. three-shot with example summaries. Zero-shot: 91% met quality bar. Three-shot: 89%. The examples added cost and zero measurable benefit — the model was already calibrated for this task.

When Your Examples Are Low Quality

This sounds obvious but it’s the most common failure mode I see. Rushed few-shot examples that don’t actually represent the ideal output teach the model the wrong distribution. You get consistent outputs, consistently wrong. Bad examples are worse than no examples. Zero-shot with a clear system prompt beats few-shot with mediocre examples every time.

Highly Variable Inputs

If your inputs span a wide domain (e.g., a general-purpose assistant), fixed examples can actually narrow the model’s response range in ways you don’t want. The examples create an implicit prior that the model tries to match, even when the input is structurally different from your examples.

A Practical Testing Framework

Don’t guess — measure. Here’s the setup I use when evaluating whether few-shot helps on a specific task:

import anthropic
import json
from typing import Callable

client = anthropic.Anthropic()

def run_prompt_variant(
    system: str,
    user_template: str,
    inputs: list[dict],
    evaluator: Callable[[str, dict], float],
    model: str = "claude-3-5-haiku-20241022"
) -> dict:
    """
    Run a prompt variant across a set of inputs and score each output.
    evaluator: function(output_text, input_dict) -> float [0, 1]
    """
    results = []
    total_input_tokens = 0
    total_output_tokens = 0

    for inp in inputs:
        user_content = user_template.format(**inp)
        response = client.messages.create(
            model=model,
            max_tokens=512,
            system=system,
            messages=[{"role": "user", "content": user_content}]
        )
        output = response.content[0].text
        score = evaluator(output, inp)

        # Track token usage for cost comparison
        total_input_tokens += response.usage.input_tokens
        total_output_tokens += response.usage.output_tokens
        results.append({"output": output, "score": score, "input": inp})

    avg_score = sum(r["score"] for r in results) / len(results)
    # Haiku pricing: $0.80/MTok input, $4/MTok output
    input_cost = (total_input_tokens / 1_000_000) * 0.80
    output_cost = (total_output_tokens / 1_000_000) * 4.00

    return {
        "avg_score": avg_score,
        "total_cost_usd": round(input_cost + output_cost, 6),
        "cost_per_call": round((input_cost + output_cost) / len(inputs), 6),
        "results": results
    }

# Zero-shot system prompt
zero_shot_system = """You are a support ticket classifier. Classify each ticket into one of:
billing, technical, feature-request, churn-risk, other.
Respond with ONLY the category label."""

# Few-shot system prompt — examples embedded directly
few_shot_system = """You are a support ticket classifier. Classify into:
billing, technical, feature-request, churn-risk, other.

Examples:
Ticket: "I've been charged twice this month and I'm thinking of canceling"
Category: churn-risk

Ticket: "How do I export my data to CSV?"
Category: technical

Ticket: "It would be great if you could add dark mode"
Category: feature-request

Respond with ONLY the category label."""

# Simple exact-match evaluator
def exact_match_evaluator(output: str, inp: dict) -> float:
    return 1.0 if output.strip().lower() == inp["label"].lower() else 0.0

# Run comparison
test_inputs = [
    {"text": "I'm going to cancel if this isn't fixed today", "label": "churn-risk"},
    {"text": "Can't log in after password reset", "label": "technical"},
    # ... add more
]

zero_shot_result = run_prompt_variant(
    zero_shot_system,
    "Ticket: {text}",
    test_inputs,
    exact_match_evaluator
)

few_shot_result = run_prompt_variant(
    few_shot_system,
    "Ticket: {text}",
    test_inputs,
    exact_match_evaluator
)

print(f"Zero-shot: score={zero_shot_result['avg_score']:.2f}, cost/call=${zero_shot_result['cost_per_call']}")
print(f"Few-shot:  score={few_shot_result['avg_score']:.2f}, cost/call=${few_shot_result['cost_per_call']}")

Run this across at least 50 representative inputs before committing to either approach. The evaluator is the hard part — for subjective tasks, you’ll need human ratings or a judge LLM. For classification and structured output, exact-match or schema-validation works fine.

Three Misconceptions Worth Addressing Directly

Misconception 1: “More examples always help”

The research on this is clear: returns diminish quickly. For Claude 3.5 models, 3-5 high-quality examples typically capture most of the available gain. Going to 10+ examples costs tokens linearly but rarely improves performance — and sometimes hurts it by creating context clutter. I’ve seen 8-shot prompts perform worse than 3-shot on the same task because the extra examples introduced subtle inconsistencies.

Misconception 2: “Few-shot examples replace a good system prompt”

Examples and instructions are complementary, not substitutes. The best setup is usually: a precise system prompt that defines the task, constraints, and output format, plus 2-4 examples that demonstrate edge cases or non-obvious decisions. The system prompt handles the general case; examples handle the specific ambiguities. This is directly relevant to writing high-performing system prompts for Claude agents — the same principles apply.

Misconception 3: “Few-shot is only for complex tasks”

Sometimes the simplest tasks benefit most from examples. A one-line output (like a category label or a boolean) with a non-obvious definition can be pinned precisely with a single example. A complex essay task may need zero examples if the instructions are good enough. Task complexity and few-shot benefit are only loosely correlated — format novelty and decision-boundary ambiguity are better predictors.

Dynamic Few-Shot: The Production Pattern Worth Knowing

Static examples embedded in your prompt are the beginner move. In production, you often want dynamic few-shot selection: retrieve the most relevant examples for each specific input from a pool, rather than using fixed examples for all inputs.

from anthropic import Anthropic
# Assumes you have a vector store set up for example retrieval
# See: https://www.unpromptedmind.com/semantic-search-embeddings-agents/

def build_dynamic_few_shot_prompt(
    task_description: str,
    user_input: str,
    example_store,  # your vector store / retriever
    n_examples: int = 3
) -> list[dict]:
    """
    Retrieve semantically similar examples for the current input,
    then construct a few-shot prompt dynamically.
    """
    # Retrieve top-n similar examples from your example bank
    similar_examples = example_store.search(user_input, top_k=n_examples)

    # Build the few-shot block
    example_block = "\n\n".join([
        f"Input: {ex['input']}\nOutput: {ex['output']}"
        for ex in similar_examples
    ])

    system = f"""{task_description}

Here are relevant examples:
{example_block}

Apply the same pattern to the user's input."""

    return [
        {"role": "user", "content": user_input}
    ], system

client = Anthropic()
messages, system = build_dynamic_few_shot_prompt(
    task_description="Classify support tickets into: billing, technical, feature-request, churn-risk, other.",
    user_input="I've been waiting 3 days for a refund and I'm losing patience",
    example_store=my_example_store  # your implementation
)

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=50,
    system=system,
    messages=messages
)

Dynamic few-shot consistently outperforms static few-shot on tasks with diverse input distributions. The tradeoff: latency from the retrieval step and complexity of maintaining an example bank. For high-value classifications or extractions, it’s worth it. For simple, uniform tasks, static examples or zero-shot are faster and nearly as good.

The Decision Matrix

Here’s the practical guide I actually use:

  • Standard task + clear instructions → zero-shot first. Test it. If accuracy meets your threshold, ship it.
  • Custom output format → 2-3 examples, minimum. Format is the hardest thing to specify in natural language.
  • Non-standard classification / edge-heavy task → 3-5 examples covering the ambiguous cases.
  • Style/tone matching → 2-4 examples of ideal output. No amount of description beats showing the actual thing.
  • High-volume + diverse inputs → dynamic few-shot if you can afford the retrieval overhead.
  • Cost-sensitive + acceptable zero-shot quality → don’t add examples. The 3-15% quality gain rarely justifies the token cost if zero-shot already meets bar.

If you’re running complex multi-step agents where prompt quality at each node compounds, the same logic applies — and it pairs well with role prompting best practices to get consistent behavior across the pipeline.

Frequently Asked Questions

How many examples should I use in a few-shot prompt for Claude?

For most tasks, 2-5 high-quality examples is the sweet spot. Claude 3.5 models generalize well from fewer examples than older models needed. Beyond 5 examples, gains are typically marginal and you’re paying for tokens that don’t move the quality needle — unless you’re doing dynamic retrieval and matching examples to input semantics.

Does the order of few-shot examples matter?

Yes, somewhat. Research across frontier models shows a mild recency bias — examples closer to the actual query have slightly more influence. Put your most representative or edge-case-covering example last, immediately before the user’s input. Don’t agonize over order, but don’t randomize it either.

Can I mix zero-shot instructions and few-shot examples in the same prompt?

Absolutely — this is often the best approach. Use the system prompt for explicit instructions (what to do, constraints, output format) and use examples to demonstrate the non-obvious cases or calibrate tone. They’re complementary, not competing. The instructions handle the general case; examples cover the ambiguities your words can’t capture precisely.

When does zero-shot outperform few-shot with Claude?

Zero-shot wins when the task is well within Claude’s training distribution (summarization, translation, basic classification), when your examples are mediocre or inconsistent, or when your inputs are so varied that fixed examples create anchoring bias. Always benchmark both before committing — don’t assume examples help just because they add specificity.

How do I evaluate whether my few-shot examples are actually improving output quality?

For structured tasks (classification, extraction), use exact-match or schema validation against a ground-truth test set of at least 50 inputs. For open-ended tasks, either use human raters on a sample or set up an LLM judge with a rubric. Log token counts for both variants so you can calculate quality-per-dollar, not just raw accuracy. The code example in this article shows the basic framework.

Bottom Line: Match the Technique to the Task Complexity

The right call on few-shot zero-shot prompting comes down to three variables: how novel your output format is, how ambiguous your decision boundaries are, and how much volume you’re running. Start zero-shot with a precise system prompt. If you’re missing your quality bar, add 2-3 targeted examples covering the failure cases you actually observed. Test both against real data before deciding. Don’t add examples because it feels safer — test them because you measured that they help.

For teams running at scale: instrument your prompts from day one. Log token counts, track quality metrics, and treat prompt configuration as a parameter you tune like any other. The teams getting the most out of Claude aren’t the ones with the cleverest prompts — they’re the ones measuring what actually works.

Put this into practice

Try the Context Manager agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply