Sunday, April 5

By the end of this tutorial, you’ll have a production-ready Python module that extracts consistent JSON LLM output from Claude, validates it against a Pydantic schema, and automatically repairs malformed responses — without a single manual fix in your pipeline.

Getting JSON out of an LLM sounds trivial until it isn’t. Claude returns markdown-wrapped JSON. GPT-4 adds trailing commas. Your open-source model hallucinates keys that don’t exist in your schema. In production, any of these will silently corrupt your downstream data or crash your pipeline at 2am. The patterns here eliminate that entire class of problem.

  1. Install dependencies — Set up Anthropic SDK, Pydantic v2, and json-repair
  2. Define your output schema — Use Pydantic models as the source of truth
  3. Craft the prompt — Structured instructions that pre-empt common failure modes
  4. Use Claude’s native JSON mode — Force structured output at the API level
  5. Validate with Pydantic — Schema enforcement with meaningful error messages
  6. Implement auto-repair — Handle partial or malformed JSON without failing
  7. Wire it into a retry loop — Full pipeline with exponential backoff

Step 1: Install Dependencies

You need four packages. Keep versions pinned — the Pydantic v1/v2 split has burned too many people.

pip install anthropic==0.30.0 pydantic==2.7.1 json-repair==0.6.1 tenacity==8.3.0

json-repair is the unsung hero here. It handles the most common LLM JSON failure modes: unclosed brackets, trailing commas, single-quoted strings, and truncated responses. tenacity gives you retry logic without writing it by hand — something I covered in depth in the article on LLM fallback and retry logic for production.

Step 2: Define Your Output Schema

Your Pydantic model is the contract. Define it first, derive everything else from it. Don’t write prompt instructions about fields separately — they’ll drift.

from pydantic import BaseModel, Field, field_validator
from typing import Optional
from enum import Enum

class Sentiment(str, Enum):
    positive = "positive"
    negative = "negative"
    neutral = "neutral"

class ReviewAnalysis(BaseModel):
    sentiment: Sentiment
    score: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")
    key_topics: list[str] = Field(min_length=1, max_length=10)
    summary: str = Field(max_length=200)
    requires_response: bool
    priority: Optional[int] = Field(None, ge=1, le=5)

    @field_validator("summary")
    @classmethod
    def strip_summary(cls, v: str) -> str:
        return v.strip()

def schema_to_prompt_spec(model: type[BaseModel]) -> str:
    """Generate a JSON schema string for use in prompts."""
    import json
    schema = model.model_json_schema()
    return json.dumps(schema, indent=2)

The schema_to_prompt_spec function generates the schema string you’ll embed directly into the prompt. This keeps the prompt and model in sync automatically — change the model, the prompt spec changes too.

Step 3: Craft the Prompt

Most JSON extraction failures happen at the prompt level. The LLM adds prose, wraps output in markdown fences, or makes up fields it thinks you probably want. The fix is being explicit about every constraint.

def build_extraction_prompt(text: str, model: type[BaseModel]) -> str:
    schema_spec = schema_to_prompt_spec(model)
    return f"""Analyze the following customer review and return a JSON object.

CRITICAL RULES:
- Return ONLY valid JSON. No markdown, no code fences, no explanation.
- Do not include any text before or after the JSON object.
- All fields are required unless marked Optional.
- The JSON must exactly match this schema:

{schema_spec}

Customer review to analyze:
<review>
{text}
</review>

Return the JSON object now:"""

The <review> tags are deliberate — they visually separate input data from instructions and help the model not confuse the review content with the output format. I’d also recommend keeping the schema injection at the end of the instructions block, right before the data, so it’s fresh in the model’s context window.

Step 4: Use Claude’s Native JSON Mode

Claude’s API supports a prefill technique — you pre-populate the assistant turn with { to force the response to begin as a JSON object. This alone eliminates about 80% of markdown-wrapping failures.

import anthropic
import json

client = anthropic.Anthropic()

def call_claude_json(prompt: str) -> str:
    """
    Call Claude with assistant prefill to force JSON output.
    Using claude-3-haiku-20240307 for cost (~$0.00025 per 1K input tokens).
    Upgrade to claude-3-5-sonnet for complex schemas.
    """
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": prompt},
            # Prefill the assistant turn — Claude must continue from here
            {"role": "assistant", "content": "{"}
        ]
    )
    # Prepend the prefill back — Claude's response excludes it
    raw = "{" + response.content[0].text
    return raw

Important: Claude’s response text does not include the prefilled content. You must prepend the opening brace yourself. Miss this and you’ll get valid-looking but invalid JSON every time. This trips up everyone the first time.

Step 5: Validate with Pydantic

Parsing is not validation. json.loads() will happily accept {"score": 999} even if your schema says score must be between 0 and 1. Always run the parsed dict through your Pydantic model.

from pydantic import ValidationError

def parse_and_validate(raw_json: str, model: type[BaseModel]) -> tuple[BaseModel | None, str | None]:
    """
    Returns (validated_model, None) on success.
    Returns (None, error_message) on failure.
    """
    try:
        data = json.loads(raw_json)
    except json.JSONDecodeError as e:
        return None, f"JSON parse error: {e}"

    try:
        validated = model.model_validate(data)
        return validated, None
    except ValidationError as e:
        # Return structured error info — useful for the repair prompt
        errors = e.errors(include_url=False)
        error_summary = "; ".join(
            f"{'.'.join(str(loc) for loc in err['loc'])}: {err['msg']}"
            for err in errors
        )
        return None, f"Schema validation failed: {error_summary}"

The error summary format matters here. You’ll pass it back into Claude in the repair step, so it needs to be readable by the model, not just by you.

Step 6: Implement Auto-Repair

Two-stage repair: first try syntactic repair with json-repair (fixes brackets, commas, quoting). If that fails validation, send Claude a targeted repair prompt with the specific error. This covers ~95% of real-world failures without burning extra tokens on every call.

from json_repair import repair_json

def attempt_repair(
    raw: str,
    error: str,
    model: type[BaseModel],
    original_prompt: str
) -> tuple[BaseModel | None, str | None]:
    
    # Stage 1: syntactic repair
    repaired = repair_json(raw)
    result, err = parse_and_validate(repaired, model)
    if result:
        return result, None

    # Stage 2: ask Claude to fix its own output
    repair_prompt = f"""You previously returned this JSON which failed validation:

{raw}

The error was: {error}

The required schema is:
{schema_to_prompt_spec(model)}

Return ONLY the corrected JSON object. No explanation."""

    try:
        fixed_raw = call_claude_json(repair_prompt)
        return parse_and_validate(fixed_raw, model)
    except Exception as e:
        return None, f"Repair attempt failed: {e}"

Stage 2 costs roughly one extra API call — about $0.00025 at Haiku pricing for a typical 500-token repair prompt. Run it only when syntactic repair fails. Over 10,000 calls, your repair rate will probably be under 5%, so the cost impact is negligible.

This pattern is closely related to broader structured output and verification patterns for reducing hallucinations — if you’re building anything that processes real-world documents at scale, that article is worth reading alongside this one.

Step 7: Wire It Into a Retry Loop

Now assemble the full pipeline with tenacity for retry logic. The key design decision: retry on network failures automatically, but surface validation errors explicitly so you can decide whether to re-prompt or fail hard.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import anthropic

@retry(
    retry=retry_if_exception_type(anthropic.APIConnectionError),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def extract_structured(text: str, output_model: type[BaseModel]) -> BaseModel:
    """
    Full pipeline: prompt -> call -> validate -> repair -> return.
    Raises ValueError if all repair attempts fail.
    """
    prompt = build_extraction_prompt(text, output_model)
    raw = call_claude_json(prompt)

    result, error = parse_and_validate(raw, output_model)
    if result:
        return result

    # Attempt repair before giving up
    result, repair_error = attempt_repair(raw, error, output_model, prompt)
    if result:
        return result

    raise ValueError(
        f"Failed to extract valid {output_model.__name__} after repair. "
        f"Last error: {repair_error}\nRaw output: {raw[:500]}"
    )


# Usage
if __name__ == "__main__":
    review_text = """
    Absolutely terrible experience. The product broke after two days, 
    customer service ignored my emails, and the packaging was damaged.
    Demanding a full refund immediately.
    """

    result = extract_structured(review_text, ReviewAnalysis)
    print(result.model_dump_json(indent=2))

This gives you a clean interface: pass in text and a model class, get back a validated Pydantic object. The caller never sees raw JSON. If you need to handle failures gracefully downstream rather than raising, wrap the call in a try/except and return a sentinel value — don’t swallow the error silently.

Common Errors and How to Fix Them

1. Claude returns JSON wrapped in markdown fences

You’ll see ```json\n{...}\n``` even when you explicitly say not to. The prefill technique from Step 4 prevents this at the API level. If you’re not using prefill, add a pre-processing strip:

import re

def strip_markdown_fences(text: str) -> str:
    # Strips ```json ... ``` and ``` ... ``` blocks
    pattern = r"```(?:json)?\s*([\s\S]*?)\s*```"
    match = re.search(pattern, text)
    return match.group(1) if match else text

2. ValidationError on enum fields

Claude often returns "Positive" (capitalized) when your enum expects "positive". The cleanest fix is case-insensitive validation at the enum level, or pre-processing the dict to lowercase known enum fields before Pydantic sees them.

class Sentiment(str, Enum):
    positive = "positive"
    negative = "negative"
    neutral = "neutral"

    @classmethod
    def _missing_(cls, value):
        # Case-insensitive fallback
        if isinstance(value, str):
            for member in cls:
                if member.value == value.lower():
                    return member
        return None

3. Missing optional fields vs. null vs. absent keys

Claude sometimes omits optional fields entirely instead of returning null. Pydantic v2 handles absent optional fields with defaults, but if you’re passing the dict directly to something that expects explicit null, you’ll get KeyErrors. Always use model.model_dump() rather than the raw dict — it respects defaults and None handling correctly.

What to Build Next

The natural extension is making this schema-aware at the workflow level — generating the Pydantic model itself from a JSON Schema or OpenAPI spec at runtime. This lets you define output contracts in your API layer and auto-derive the extraction prompt and validator without writing Python models by hand. Pair this with Claude’s document processing capabilities for invoice and form extraction, and you have a fully typed, self-validating document processing pipeline.

If you’re routing extraction across multiple models based on complexity (Haiku for simple schemas, Sonnet for nested structures), check out the Claude vs GPT-4 reliability benchmark — the structured output reliability numbers there will inform which model to use at each tier.

Bottom line by reader type: If you’re a solo founder or running a small automation, the prefill + Pydantic pattern from Steps 4-5 gets you 90% of the reliability with minimal code. If you’re building a production pipeline processing thousands of documents, implement the full repair loop and add logging around which requests hit Stage 2 repair — that data will tell you where to tighten your prompts. For consistent JSON LLM output at scale, treating schema validation as a first-class concern — not an afterthought — is what separates pipelines that work from pipelines that need babysitting.

Frequently Asked Questions

Does Claude’s API have a native JSON mode like OpenAI?

Not in exactly the same form. Claude doesn’t have a response_format: {"type": "json_object"} parameter like OpenAI’s API. The closest equivalent is the assistant prefill technique — pre-populating the assistant turn with { — combined with explicit schema instructions in your prompt. Anthropic’s tool use feature can also enforce structured output by defining a tool with a JSON schema and requiring Claude to call it.

How do I handle JSON that’s too large and gets truncated mid-response?

Set max_tokens high enough that truncation isn’t a risk — calculate your worst-case output size and add 20% headroom. If truncation does happen, json-repair can often close open structures, but the content will be incomplete. A better strategy is to break large extractions into smaller, bounded schemas and make multiple focused calls rather than one giant extraction.

What’s the difference between using Pydantic for validation vs. just checking json.loads()?

json.loads() only verifies syntax — it’ll accept any valid JSON regardless of whether it matches your expected structure. Pydantic validates types, required fields, value ranges, string lengths, enum membership, and custom rules. Without it, you’ll get runtime AttributeErrors or silent data corruption when Claude returns a field with the wrong type or unexpected values.

Can I use this same pattern with GPT-4 or open-source models?

Yes — the Pydantic validation and json-repair layers are model-agnostic. The prefill technique is Claude-specific; for OpenAI you’d use response_format: {"type": "json_object"} instead. For open-source models like Llama or Mistral via Ollama, JSON output reliability varies significantly — you’ll want to lean more heavily on the repair stage and consider few-shot examples in your prompt to demonstrate the expected format.

How much does the repair step cost in practice?

At claude-3-haiku pricing (~$0.00025 per 1K input tokens, $0.00125 per 1K output tokens), a typical repair call with a 500-token prompt costs under $0.001. If your repair rate is 5% across 10,000 calls, that’s 500 extra calls adding roughly $0.50 in repair costs. The real cost is engineering time lost debugging silent failures — the repair loop pays for itself immediately.

Should I use Claude’s tool use feature instead of prompt-based JSON extraction?

Tool use is worth considering for complex schemas because it enforces the schema at the API level — Claude must conform to the tool’s input schema. The tradeoff is slightly higher prompt overhead and more complex setup. For simple, flat schemas, the prefill approach is faster to implement and cheaper. For nested schemas with strict validation requirements, tool use gives you stronger guarantees with less prompt engineering.

Put this into practice

Try the Prompt Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.


Share.
Leave A Reply