Sunday, April 5

Most developers trying to improve their LLM outputs reach for the same tools in the wrong order. They add a system prompt, see marginal improvement, then pile on role instructions, then chain-of-thought, then wonder why their prompts are 800 tokens long and still hallucinating. The reality is that each of the major prompt engineering techniques has a specific problem domain where it earns its token cost — and domains where it actively hurts. This article breaks down chain-of-thought, role prompting, and constitutional AI with concrete examples, real failure modes, and code you can drop into your workflow today.

The Three Techniques, Briefly Defined

Before comparing them, a fast definition of each so we’re using the same language:

  • Chain-of-Thought (CoT): You instruct the model to reason step-by-step before producing a final answer. Either via few-shot examples (“Q: … A: Let’s think step by step…”) or zero-shot (“Think through this carefully before answering”).
  • Role Prompting: You assign the model a persona or expert identity. “You are a senior DevOps engineer with 10 years of Kubernetes experience.” The hypothesis is that the framing shapes the output distribution toward expert-quality responses.
  • Constitutional AI (CAI): Developed by Anthropic, this technique has the model critique and revise its own outputs against a set of principles before finalizing. In prompt engineering practice, this means asking the model to evaluate its draft against rules you specify, then rewrite.

You’ve probably tried all three. The question is whether you’re using them where they actually move the needle.

Chain-of-Thought: When Reasoning Chains Actually Help

CoT delivers the most measurable gains on tasks that have verifiable intermediate steps — math, logic puzzles, multi-hop reasoning, code debugging. The original Google paper showed 40%+ accuracy improvements on grade-school math benchmarks. That’s real. But most production use cases aren’t grade-school math.

Where CoT earns its tokens

In my experience, CoT consistently helps with:

  • Multi-condition business logic (“If the user is on the free tier AND their trial expired AND they haven’t seen the upgrade modal in 7 days, then…”)
  • Debugging tasks where the model needs to trace state through code
  • Classification tasks with overlapping categories where edge cases matter
  • Any task where you need the model to show its work so you can catch errors downstream

Here’s a zero-shot CoT pattern I use for classification tasks in production:

import anthropic

client = anthropic.Anthropic()

def classify_with_cot(text: str, categories: list[str]) -> dict:
    categories_str = ", ".join(categories)
    
    prompt = f"""Classify the following support ticket into one of these categories: {categories_str}

Ticket: {text}

Work through this step by step:
1. Identify the core problem the user is describing
2. Note any secondary issues mentioned
3. Consider which category best fits the PRIMARY issue
4. State your final classification and confidence (high/medium/low)

Format your response as:
REASONING: <your step by step analysis>
CATEGORY: <single category from the list>
CONFIDENCE: <high|medium|low>"""

    response = client.messages.create(
        model="claude-haiku-3-5",  # Haiku is fine for structured classification
        max_tokens=400,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse structured output
    content = response.content[0].text
    lines = content.strip().split('\n')
    result = {}
    for line in lines:
        if line.startswith('CATEGORY:'):
            result['category'] = line.replace('CATEGORY:', '').strip()
        elif line.startswith('CONFIDENCE:'):
            result['confidence'] = line.replace('CONFIDENCE:', '').strip()
        elif line.startswith('REASONING:'):
            result['reasoning'] = line.replace('REASONING:', '').strip()
    
    return result

This runs on Claude Haiku at roughly $0.001–0.002 per classification at current pricing. The structured output makes it easy to log the reasoning separately and flag low-confidence cases for human review.

Where CoT wastes tokens

CoT adds almost nothing to simple retrieval tasks, format conversion, or single-step transformations. Asking a model to “think step by step” before converting a CSV to JSON is burning tokens for theater. On Haiku, that’s cheap. On GPT-4o, you’re paying for nothing. If the task doesn’t require intermediate reasoning, skip CoT.

Role Prompting: Expert Persona vs Empty Costume

Role prompting is the most overused technique in the toolkit. It’s the first thing people try and the last thing they examine critically. Telling Claude “you are a world-class Python engineer” does have a real effect — but it’s not magic, and it degrades fast as the task gets more specific.

When role prompting actually changes the output

Role prompting works best when the role carries implicit stylistic and structural expectations that align with what you want. “You are a technical writer producing API documentation” shapes output toward conciseness, proper code block formatting, and consistent terminology — because that’s what good API docs look like in the training data. The role is a shortcut for a bunch of style instructions you’d otherwise have to write explicitly.

Where it genuinely helps:

  • Setting tone and register (formal analyst vs. casual Slack summary)
  • Triggering domain-specific output patterns (legal language, medical documentation format)
  • Suppressing unwanted tendencies (a “blunt code reviewer” persona reduces hedging)
SYSTEM_PROMPT = """You are a senior backend engineer conducting a code review. 
Your reviews are direct and specific. You:
- Flag security issues first, always
- Point to the exact line or pattern that's problematic
- Suggest the fix, not just the problem
- Skip praise unless something is genuinely non-obvious and well done
- Keep total review under 300 words"""

# This persona does real work: it shapes length, priorities, and tone
# in ways that would take 5-6 explicit instructions to match

Where role prompting fails

The persona breaks down on specialized factual knowledge. “You are a tax attorney” doesn’t give Claude knowledge it doesn’t have about your jurisdiction’s current tax code. It’ll confidently produce attorney-flavored output that’s factually wrong. Role prompting shapes style and structure; it does not inject knowledge. I’ve watched founders ship customer-facing legal summaries under “you are a legal expert” prompts and end up with polished-sounding hallucinations. Don’t do this.

Also: nested or contradictory roles (“you are a friendly but firm, casual but professional, creative but precise assistant”) produce mush. Pick one dominant role.

Constitutional AI in Prompt Practice: Self-Critique That Actually Runs

Anthropic’s Constitutional AI is a training methodology, not a prompting trick — but the core idea is fully reproducible at inference time. You give the model a set of principles, have it generate a draft, critique the draft against those principles, and revise. This is expensive (you’re doing 2–3 LLM calls per output) but it’s the most reliable technique for outputs where correctness and safety constraints matter more than speed.

Implementing CAI-style self-critique in your prompts

import anthropic

client = anthropic.Anthropic()

PRINCIPLES = """
1. Do not make claims about specific ROI or revenue impact without a source
2. Do not present speculation as fact — qualify uncertain statements
3. Do not recommend tools you have no information about in this context
4. Prefer concrete examples over abstract descriptions
"""

def constitutional_generate(user_request: str) -> str:
    # Step 1: Generate initial draft
    draft_response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=800,
        messages=[{"role": "user", "content": user_request}]
    )
    draft = draft_response.content[0].text

    # Step 2: Self-critique against principles
    critique_prompt = f"""Here is a draft response:

<draft>
{draft}
</draft>

Evaluate this draft against each of these principles:
{PRINCIPLES}

For each principle, state: PASS or FAIL, and if FAIL, explain specifically what's wrong.
Then write a revised version that fixes all failures."""

    final_response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1200,
        messages=[{"role": "user", "content": critique_prompt}]
    )
    
    # In production, parse out just the revised version
    # Here returning full critique + revision for visibility
    return final_response.content[0].text

# Cost note: Two Sonnet calls per run ≈ $0.015–0.025 depending on length
# Only justified when output quality has direct business consequences

The cost is real: two Sonnet-class calls per output runs roughly $0.015–0.025 at current pricing for moderate-length content. I’d only use this pattern for outputs going directly to customers, legal documents, or medical-adjacent content where a hallucination has actual consequences.

Where CAI-style prompting breaks down

The model critiquing its own output has a well-documented blind spot: it tends to agree with its own draft. If the original response confidently stated something wrong, the critique step often passes it anyway. This is model sycophancy applied to self-review. Mitigation: use a different model for the critique step (generate with Haiku, critique with Sonnet), or use structured rubrics with specific yes/no questions rather than open-ended evaluation.

Head-to-Head: Which Technique for Which Task

Here’s how I actually assign techniques in production workflows:

  • Math, logic, multi-step reasoning: CoT, zero-shot or few-shot. Nothing else comes close.
  • Content generation (blog posts, emails, summaries): Role prompting for style calibration, skip CoT entirely.
  • Classification and routing: CoT with structured output parsing. Role adds little.
  • Customer-facing factual content: CAI-style self-critique. The extra call cost is worth the error reduction.
  • Code generation: Role prompting (sets language idioms and code style) + CoT for complex logic (not boilerplate).
  • Agent tool-use decisions: CoT is critical here — you want the model’s reasoning visible before it fires an action.

The techniques aren’t mutually exclusive. A code review agent might use role prompting in the system prompt, CoT for analyzing a complex function, and a lightweight CAI pass before posting the review comment. But adding all three to every prompt is how you end up with expensive, slow pipelines that don’t outperform a well-written single-shot prompt.

The Bottom Line: Picking the Right Tool

If you’re a solo founder or indie developer shipping fast and watching costs: default to role prompting for style, add CoT only for reasoning-heavy tasks, and skip CAI until you have content that genuinely needs it. You’ll cover 80% of use cases at minimal cost.

If you’re a team building production pipelines with customer-facing outputs: instrument your prompts to measure output quality by technique. CoT adds latency and tokens — quantify whether it’s buying you accuracy in your specific domain. CAI is worth the cost for regulated or high-stakes outputs; build it as a separate review stage rather than embedded in every call.

If you’re building AI agents with tool access: CoT is non-negotiable. An agent that doesn’t reason visibly before acting is an agent you can’t debug. Role prompting helps keep agents on-task. CAI is useful as a final safety check before irreversible actions.

The honest verdict on prompt engineering techniques: CoT has the clearest empirical support for reasoning tasks. Role prompting is overused but genuinely useful for style control. CAI is the most underused for high-stakes outputs despite being straightforward to implement. Stop adding all three by default and start matching technique to task type — your token costs and output quality will both improve.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply