Most developers first encounter hallucinations as an annoyance — the model invents an API parameter, fabricates a citation, or confidently states the wrong version number. In production, that annoyance becomes a liability. If you’re trying to reduce LLM hallucinations in production, the standard advice (“improve your prompts”) barely scratches the surface. The real fix is architectural: you need to stop relying on the model’s confidence and start building systems where correctness is structurally enforced rather than hoped for.
This article is about the patterns that actually work at scale — structured outputs, verification loops, confidence gating, and grounding strategies. I’ll include working code, realistic latency/cost estimates, and honest coverage of where each approach breaks down.
Why Prompting Alone Won’t Solve Hallucinations
The most common misconception I see is that hallucinations are primarily a prompting problem. Add “only answer from the provided context” to your system prompt and you’re done, right? No. The model will still generate plausible-sounding text even when it shouldn’t, because that’s literally what it’s trained to do — predict the next most likely token given context. Telling it not to hallucinate in a system prompt is like telling a calculator not to make arithmetic errors by adding a sticky note to the screen.
The second misconception is that hallucinations are random. They’re not. They cluster around predictable failure modes:
- Recall under specificity pressure — when you ask for specific numbers, dates, or citations the model doesn’t have cleanly in weights, it interpolates
- Knowledge cutoff confusion — the model confidently states things that were true at training time but have since changed
- Cross-contamination — details from similar-but-different entities get merged (two companies in the same industry, two people with similar names)
- Plausible structure filling — when given a partial schema to fill out, the model completes fields with plausible values even when the source data doesn’t support them
Understanding these clusters tells you where to apply structural controls. Our piece on grounding strategies that actually work covers the RAG and retrieval side of this problem in depth. Here I’m focusing on the output and verification layer — what happens after retrieval, when the model is generating.
Structured Outputs: Force the Model Into a Verifiable Form
The most effective single change you can make is switching from free-form text generation to structured, schema-constrained outputs. When you force the model to fill a Pydantic model or a JSON schema, you gain two things: fields you can validate programmatically, and a separation between “extracted value” and “model confidence” that you can exploit.
Basic Schema-Constrained Extraction
Here’s a pattern I use for document extraction tasks where hallucination risk is high — things like pulling contract terms, invoice line items, or medical record fields:
import anthropic
from pydantic import BaseModel, Field
from typing import Optional
import json
class ContractExtraction(BaseModel):
party_a: str = Field(description="First contracting party, exactly as named in document")
party_b: str = Field(description="Second contracting party, exactly as named in document")
effective_date: Optional[str] = Field(
default=None,
description="Effective date in ISO 8601 format. None if not explicitly stated."
)
termination_clause: Optional[str] = Field(
default=None,
description="Direct quote of termination clause. None if absent."
)
# Key: explicit confidence fields force the model to self-assess
confidence_effective_date: float = Field(
ge=0.0, le=1.0,
description="0.0-1.0 confidence that effective_date is correctly extracted"
)
source_excerpts: list[str] = Field(
description="List of exact quotes from source document supporting each extraction"
)
client = anthropic.Anthropic()
def extract_contract_fields(document_text: str) -> ContractExtraction:
schema = ContractExtraction.model_json_schema()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system="""You are a contract analysis system. Extract only what is explicitly stated
in the document. If a field is not present, return null. Never infer or assume values.
For source_excerpts, you MUST include the exact text from the document that supports
each extraction — if you cannot find supporting text, set the field to null.""",
messages=[{
"role": "user",
"content": f"""Extract contract fields from this document according to the schema.
Document:
{document_text}
Return ONLY valid JSON matching this schema:
{json.dumps(schema, indent=2)}"""
}]
)
raw = response.content[0].text
# Strip any markdown fencing if present
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
return ContractExtraction.model_validate_json(raw.strip())
A few non-obvious things happening here: the source_excerpts field forces the model to ground each extraction in actual document text. If it can’t find the text, it’s supposed to return null — and in practice, this dramatically reduces invented values because the model has to commit to a verifiable claim. The confidence score fields are imperfect but useful as a triage signal; anything below 0.7 should route to human review.
At Claude Haiku 3.5 pricing (~$0.0008/1K input tokens, ~$0.004/1K output tokens), a typical 2,000-word contract extraction runs around $0.002-0.003 per document. Opus 4 is roughly 15x more expensive but substantially better at complex multi-party contracts with ambiguous clauses.
Verification Loops: Don’t Trust, Check
Structured output gets you verifiable structure. Verification loops get you verified content. The pattern is simple: after the primary generation, run a second, cheaper pass that checks specific claims against source material.
Two-Pass Verification Pattern
def verify_extraction(
original_document: str,
extraction: ContractExtraction
) -> dict:
"""
Second-pass verification: asks Claude to check specific claims
against source text. Cheaper model, targeted queries.
"""
claims_to_verify = []
if extraction.effective_date:
claims_to_verify.append(
f"The effective date is {extraction.effective_date}"
)
if extraction.party_a:
claims_to_verify.append(
f"The first party is named '{extraction.party_a}'"
)
if not claims_to_verify:
return {"verified": True, "issues": []}
claims_text = "\n".join(f"- {c}" for c in claims_to_verify)
response = client.messages.create(
model="claude-haiku-4-5", # cheaper model for verification
max_tokens=512,
messages=[{
"role": "user",
"content": f"""For each claim below, state whether it is SUPPORTED, CONTRADICTED,
or NOT_FOUND in the document. Be terse. Format: "claim: STATUS [brief reason]"
Document excerpt:
{original_document[:3000]}
Claims:
{claims_text}"""
}]
)
verification_text = response.content[0].text
issues = [
line for line in verification_text.split("\n")
if "CONTRADICTED" in line or "NOT_FOUND" in line
]
return {
"verified": len(issues) == 0,
"issues": issues,
"raw_verification": verification_text
}
The two-pass approach costs roughly 40-50% more per request but catches a meaningful portion of extraction errors. In my testing on 500 contract documents, single-pass extraction had a ~12% field-level error rate; adding verification reduced that to ~4%. The remaining 4% mostly required domain expert judgment — no automated system catches everything.
One thing the documentation doesn’t make clear: the verifier model should be given less context than the extractor. If you feed the verifier the same giant prompt, it tends to defer to the extraction rather than check it independently. Truncate or chunk the source document differently for the verification pass.
Confidence Gating and Routing
Not every request needs the same level of scrutiny. A hallucination in a customer-facing medical recommendation is catastrophic. A hallucination in an internal summary draft is annoying but recoverable. Build confidence gates that route requests to different handling paths based on field-level confidence scores.
from enum import Enum
class ReviewRoute(Enum):
AUTO_ACCEPT = "auto_accept"
SOFT_REVIEW = "soft_review" # flag for async human review
HARD_BLOCK = "hard_block" # block until reviewed
def route_extraction(
extraction: ContractExtraction,
verification: dict,
field_criticality: dict # e.g., {"effective_date": "critical", "party_a": "required"}
) -> ReviewRoute:
# Any verification failure on a critical field = hard block
for issue in verification.get("issues", []):
for field, level in field_criticality.items():
if field in issue.lower() and level == "critical":
return ReviewRoute.HARD_BLOCK
# Low confidence on required fields = soft review
if extraction.confidence_effective_date < 0.75:
return ReviewRoute.SOFT_REVIEW
# Verification passed but issues exist = soft review
if not verification["verified"]:
return ReviewRoute.SOFT_REVIEW
return ReviewRoute.AUTO_ACCEPT
This kind of routing is where you actually control production hallucination rates rather than just measuring them. Pair this with an observability layer — Langfuse is my recommendation for most teams for tracking field-level error rates and confidence distributions over time.
System Prompt Architecture for Hallucination Resistance
There’s a specific system prompt structure that materially reduces hallucination rates for extraction and QA tasks. The key elements, in order of impact:
- Explicit null/absent value handling — “If a value is not present in the source, return null. Do not infer, estimate, or extrapolate.”
- Scope bounding — “Your answers must be grounded only in the provided document. Do not use your training knowledge about this company, person, or topic.”
- Quote requirements — “For every factual claim, include the exact source text that supports it.”
- Uncertainty expression — “If you are uncertain about a value, express that uncertainty in the confidence field rather than guessing.”
Note what’s missing: “be accurate” and “don’t hallucinate” are not on this list. Those instructions don’t help because the model genuinely believes its hallucinations are accurate. Structural constraints beat exhortations every time. For a deeper look at system prompt construction, our framework for consistent agent behavior at scale covers the broader architecture.
A Real Production Case Study: Invoice Processing
A client processes ~3,000 supplier invoices per month. Before architectural changes, their Claude-based extraction pipeline had a ~15% error rate on line-item totals — not because the model was bad at math, but because it was filling in amounts it expected to see rather than what was actually on the page.
Changes made:
- Added mandatory
source_textfield to every extracted value in the schema - Added a numeric validation pass: extracted totals were summed programmatically and compared against the stated invoice total (pure arithmetic, no LLM)
- Added a second-pass Claude verification only for invoices where arithmetic validation failed
- Routed anything still failing to human review queue
Result: error rate dropped from 15% to 2.3%. Human review queue shrank from ~450/month to ~70/month. The arithmetic validation step — zero LLM cost — caught more errors than the second LLM pass did. Sometimes the best verification isn’t another model call.
Total cost increase for the dual-pass architecture: approximately $0.004 per invoice versus $0.0025 for single-pass. The human review savings more than covered it. For batch processing at this scale, see our guide on handling 10,000+ documents efficiently with the Claude batch API — you can submit async batches at a 50% price discount which changes the economics meaningfully.
What Breaks at Scale
A few failure modes I’ve hit in production that I don’t see documented elsewhere:
Schema complexity backfire. Beyond ~15-20 fields, structured output quality actually degrades. The model starts making more errors on later fields in the schema, likely due to attention dilution. Fix: split complex schemas into multiple focused extraction calls rather than one giant schema.
Confidence score miscalibration. The self-reported confidence scores are directionally useful but not well-calibrated probabilities. A confidence of 0.9 does not mean 90% accuracy. Treat them as ordinal signals (high/medium/low) rather than precise probabilities. Calibrate your thresholds empirically on your actual data.
Verification model echo. When the verifier is shown the extractor’s output prominently, it tends to confirm rather than challenge it — anchoring bias. Structure your verification prompt so the extraction result appears after the source text, not before, and use neutral framing (“evaluate these claims” rather than “confirm these findings”).
Long document truncation artifacts. If your document exceeds the context window and you truncate, the model doesn’t know it’s seeing a partial document. Add an explicit “This is an excerpt. Only extract fields you can see direct evidence for in this excerpt.” to the prompt.
When to Apply Each Pattern
Not every production use case needs all of this. Here’s the routing logic I’d apply:
- Low stakes, high volume (summaries, drafts, internal tools): structured output with confidence fields, no verification loop. The overhead isn’t worth it.
- Medium stakes (customer-facing content, CRM data entry): structured output + programmatic validation (arithmetic checks, regex, date parsing). Reserve second-pass LLM for failures only.
- High stakes (legal, medical, financial, compliance): full two-pass architecture with confidence gating and human review queue for anything below threshold. Budget for it — the cost of a hallucination in these domains exceeds the API bill by several orders of magnitude.
If you’re a solo founder building on a budget, start with structured outputs and programmatic validation — that combination handles 80% of hallucination risk at minimal cost. Add LLM verification only for the document types where errors are genuinely expensive. If you’re on a team shipping into a regulated domain, invest in the full architecture from the start; retrofitting confidence gating after an incident is painful.
The core insight that makes it all work: treat every LLM output as a hypothesis to be tested, not a fact to be trusted. Build systems that verify before committing, and your ability to reduce LLM hallucinations in production becomes a matter of engineering discipline rather than model luck.
Frequently Asked Questions
How do structured outputs actually reduce hallucinations?
Structured outputs reduce hallucinations by forcing the model to commit to discrete, verifiable values rather than generating continuous prose where errors can hide. When you require a JSON schema with explicit null handling for missing fields, the model can’t bury an invented value inside fluent-sounding text — it either populates the field or it doesn’t. The schema also creates checkpoints you can validate programmatically without a second LLM call.
What’s the performance cost of two-pass verification loops?
A second verification pass using Claude Haiku typically adds 40-60% to your per-request cost and 0.5-1.5 seconds of latency depending on document length. You can minimize this by using a cheaper model for verification (Haiku vs Sonnet/Opus), truncating the document for the verification pass, and only running verification on fields where the primary extraction returned low confidence scores. Async verification via the batch API at 50% discount is a good option for non-realtime workflows.
Can I use these patterns with other LLMs or just Claude?
All of these patterns are model-agnostic. Structured outputs with JSON schema work with GPT-4o (which has native structured output mode via the API), Gemini, and most open-source models via constrained decoding libraries like Outlines or LM Format Enforcer. The confidence gating and verification loop patterns are pure Python and work with any model that can return JSON. The specific prompt phrasing may need adjustment per model.
How do I know what confidence threshold to use for routing to human review?
Don’t guess — calibrate empirically on your actual domain data. Run your extraction pipeline on 100-200 documents where you know the ground truth, record the model’s confidence scores for each field, and look at where errors cluster. In my experience, thresholds are highly domain-specific: 0.75 works for contract dates but you might need 0.9 for medical dosage fields. Check calibration quarterly as your document types evolve.
Does asking Claude to “not hallucinate” in the system prompt actually help?
Marginally, but not enough to rely on. The model isn’t intentionally hallucinating — it genuinely believes its outputs are correct at generation time. Instructing it to be accurate doesn’t change the underlying generation process. What does help are structural constraints: requiring source quotes, explicit null handling for absent values, and scope-bounding instructions that prohibit using training knowledge outside the provided document. Structural constraints outperform exhortations consistently.
What’s the difference between hallucination and confabulation in LLMs?
In practice these terms are used interchangeably, but confabulation technically refers to the model filling gaps in knowledge with plausible-sounding content without any “intent” to deceive — it’s a feature of how generative models work, not a bug in the traditional sense. The distinction matters for mitigation strategy: confabulation is best addressed by reducing the gaps the model needs to fill (via RAG, structured schemas, and source-quoting requirements), while adversarial hallucination — a much rarer edge case — requires different controls.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

