How to cut Claude API costs in half: caching, batching, and model selection strategies

Q: How do I measure whether my prompt caching is actually working?

Check the usage object in the API response — it includes cache_creation_input_tokens (tokens written to cache, billed at 125% of standard) and cache_read_input_tokens (tokens served from cache, billed at 10% of standard). If you're seeing zero cache reads after the first request, your cache boundary is likely positioned incorrectly — it must be placed after the stable content, before the dynamic content.

If you’re running Claude at any meaningful scale, you’ve probably opened your Anthropic billing dashboard and felt a small jolt of panic. A few agents, a document pipeline, a customer-facing feature — and suddenly you’re looking at hundreds of dollars a month with a clear upward trajectory. The good news: most teams are leaving 40–60% savings on the table because they haven’t applied the three core techniques that actually move the needle. This article is a precise walkthrough of how to reduce Claude API costs using prompt caching, batch processing, and intelligent model routing — with real numbers attached to each technique so you can prioritize what to implement first.

I’ll rank these by impact, cover the implementation details, and show you how stacking all three can cut your bill by more than half without touching quality in any meaningful way.

Technique	Typical Savings	Best For	Effort
Prompt Caching	40–90% on input tokens	Repeated system prompts, RAG context	Low
Batch Processing	50% flat on all tokens	Async document pipelines	Medium
Model Downgrading	80–95% on eligible tasks	Classification, extraction, routing	Medium
Output Compression	20–40% on output tokens	Any generation-heavy workflow	Low

1. Prompt Caching: The Highest-Leverage Change You Can Make Today

Prompt caching is Anthropic’s mechanism for reusing the KV cache from a previous request. When you mark a block of your prompt with cache_control: {"type": "ephemeral"}, the API stores the processed tokens for up to 5 minutes (extended with repeated use). Subsequent requests that hit that cache pay 10% of the normal input token price — a 90% discount on those tokens.

Cache write cost is 25% more than standard input price, so the math only works in your favor after roughly 1.1 cache hits per unique prompt. In practice, most agents hit that threshold within the first few requests of a session.

Where cache hits actually accumulate

The biggest wins come from three places: long system prompts, RAG document chunks injected into every request, and few-shot examples. If your system prompt is 2,000 tokens and you’re making 1,000 calls per day on Sonnet 3.5 (which costs $3/million input tokens), that’s $6/day in uncached system prompt cost. With caching and a realistic 80% hit rate, you pay $1.44/day on that component — saving $4.56 daily or roughly $137/month from one prompt block.

import anthropic

client = anthropic.Anthropic()

# Your long system prompt or document context — cache this
SYSTEM_PROMPT = """You are a senior financial analyst specialising in SaaS metrics...
[2000+ tokens of instructions, context, and examples here]
"""

def call_with_cache(user_message: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                # This tells the API: cache everything up to this point
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {"role": "user", "content": user_message}
        ]
    )
    
    # Check cache performance in the response
    usage = response.usage
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")
    print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
    
    return response.content[0].text

One important caveat: the cache is per-model and per-account, and it expires after 5 minutes of inactivity. For long-running pipelines with gaps, implement a keep-alive pattern — a lightweight dummy request every 4 minutes to reset the TTL. It costs ~$0.0001 per ping and saves the full cache re-write cost.

For RAG workflows, cache your retrieved document chunks before the user query, not after. The cache breakpoint must come before the dynamic content. If you’re building document processing pipelines, our RAG pipeline implementation guide covers how to structure your context windows for both quality and cost efficiency.

2. Batch Processing: A Guaranteed 50% Discount With One Architectural Change

The Anthropic Batch API gives you a flat 50% discount on all tokens — input and output — in exchange for async processing with up to 24-hour turnaround. This isn’t a trick or an edge case. It’s a first-class API designed for exactly this use case.

The constraint is that you cannot use it for anything requiring a synchronous response. User-facing chat: no. Background document processing: yes. Nightly report generation: yes. Bulk classification of last month’s support tickets: absolutely yes.

What the numbers look like at scale

Processing 10,000 documents with an average of 500 input tokens and 200 output tokens each on Claude Sonnet 3.5:

Standard API: 5M input tokens × $3 + 2M output × $15 = $15 + $30 = $45
Batch API: Same token counts at 50% = $22.50

That’s $22.50 saved on a single batch job. Run that pipeline daily and you’re saving $675/month from one change.

import anthropic
import json

client = anthropic.Anthropic()

def submit_batch(documents: list[dict]) -> str:
    """Submit documents for batch processing. Returns batch_id."""
    
    requests = []
    for i, doc in enumerate(documents):
        requests.append({
            "custom_id": f"doc-{i}-{doc['id']}",  # Use for matching results later
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 256,
                "messages": [
                    {
                        "role": "user",
                        "content": f"Classify this support ticket. Return JSON with 'category' and 'priority'.\n\n{doc['text']}"
                    }
                ]
            }
        })
    
    batch = client.messages.batches.create(requests=requests)
    print(f"Batch submitted: {batch.id} — {len(requests)} requests")
    return batch.id

def poll_and_collect(batch_id: str) -> list[dict]:
    """Poll until complete, then collect results."""
    import time
    
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        
        if batch.processing_status == "ended":
            break
            
        print(f"Status: {batch.processing_status} — "
              f"{batch.request_counts.processing} processing, "
              f"{batch.request_counts.succeeded} done")
        time.sleep(60)  # Poll every minute — don't hammer the API
    
    # Collect results
    results = []
    for result in client.messages.batches.results(batch_id):
        if result.result.type == "succeeded":
            results.append({
                "id": result.custom_id,
                "output": result.result.message.content[0].text
            })
    
    return results

We’ve written a deeper implementation guide covering error handling, retry logic for failed batch items, and throughput optimization in our article on batch processing workflows with Claude API. If you’re handling 10,000+ documents regularly, that’s required reading.

3. Model Routing: Haiku vs Sonnet vs Opus — Pick the Right Tool

This is where the biggest absolute dollar savings live, but it requires the most discipline. The price gap between models is enormous:

Claude Haiku 3.5: $0.80/M input, $4/M output
Claude Sonnet 3.5: $3/M input, $15/M output
Claude Opus 4: $15/M input, $75/M output

Sonnet costs 3.75× more than Haiku. Opus costs nearly 19× more than Haiku. If you’re running Sonnet for tasks that Haiku handles identically — and there are many — you are paying a 3.75× premium for nothing.

Tasks where Haiku matches Sonnet in practice

After running both models on production workloads across multiple projects, I’d confidently route these to Haiku without hesitation:

Binary or multi-class classification (sentiment, intent, category)
Structured data extraction from well-formed documents
Entity extraction from short texts
Routing/triage decisions in agent pipelines
Simple template-based generation with constrained outputs
JSON formatting of already-extracted data

Keep Sonnet for tasks requiring reasoning, nuanced writing, complex code generation, or multi-step problem solving. See our Claude vs GPT-4 code generation benchmarks for a sense of where capability gaps actually matter in practice.

Implementing a model router

from enum import Enum
from dataclasses import dataclass

class TaskComplexity(Enum):
    SIMPLE = "simple"      # → Haiku
    STANDARD = "standard"  # → Sonnet
    COMPLEX = "complex"    # → Sonnet (or Opus for critical reasoning)

@dataclass
class ModelConfig:
    model_id: str
    cost_per_m_input: float
    cost_per_m_output: float

MODEL_CONFIGS = {
    TaskComplexity.SIMPLE: ModelConfig(
        "claude-haiku-3-5", 0.80, 4.00
    ),
    TaskComplexity.STANDARD: ModelConfig(
        "claude-sonnet-4-5", 3.00, 15.00
    ),
    TaskComplexity.COMPLEX: ModelConfig(
        "claude-opus-4-5", 15.00, 75.00
    ),
}

def classify_task(task_type: str, context_length: int, requires_reasoning: bool) -> TaskComplexity:
    """Route tasks to the appropriate model tier."""
    
    # Simple tasks: short context, no reasoning, structured output
    if (task_type in {"classify", "extract", "route", "format"} 
            and context_length < 2000 
            and not requires_reasoning):
        return TaskComplexity.SIMPLE
    
    # Complex tasks: multi-step reasoning, long documents, creative work
    if requires_reasoning or task_type in {"analyze", "synthesize", "debug_complex"}:
        return TaskComplexity.STANDARD  # Default to Sonnet; use Opus sparingly
    
    return TaskComplexity.STANDARD

def get_model_for_task(task_type: str, context_length: int, requires_reasoning: bool = False) -> str:
    complexity = classify_task(task_type, context_length, requires_reasoning)
    config = MODEL_CONFIGS[complexity]
    return config.model_id

The practical approach: audit your current API logs, group calls by task type, and estimate what percentage could drop to Haiku. In most document processing pipelines I’ve seen, 60–70% of calls are eligible. At that ratio, your blended input cost drops from $3/M to around $1.56/M — a 48% reduction on input alone, without touching output costs or adding any caching.

4. Output Token Compression: The Underrated Lever

Output tokens cost 5× more than input tokens on Sonnet. A response that runs 500 tokens when 150 would suffice is costing you 3.3× what it needs to. The fix is explicit prompt instruction — models comply reliably when told directly.

Add this to your system prompt for any task where you don’t need verbose prose:

“Be concise. Return only what is asked. Do not add preamble, explanation, or closing remarks unless explicitly requested. For structured tasks, return only the requested format.”

Pair this with max_tokens enforcement. Setting max_tokens=256 on a classification task doesn’t just cap runaway outputs — it signals to the model that brevity is expected, which tends to tighten the whole response.

For JSON extraction tasks, specifying the exact schema in your prompt (rather than describing it loosely) reduces both token count and the likelihood of hallucinated fields. If you’re building extraction pipelines that need to stay grounded, our guide on reducing LLM hallucinations in production covers structured output patterns that work at scale.

Stacking All Four: What the Combined Savings Look Like

Let’s take a realistic production pipeline: 50,000 document classification calls per day, each with a 1,500-token system prompt and 300-token document, returning a 50-token JSON classification. Currently running on Sonnet with no caching.

Baseline daily cost (Sonnet, no optimizations):

Input: 50,000 × 1,800 tokens = 90M tokens × $3/M = $270
Output: 50,000 × 50 tokens = 2.5M tokens × $15/M = $37.50
Total: $307.50/day (~$9,225/month)

After all four optimizations:

Switch to Haiku (classification task, qualifies): input now $0.80/M, output $4/M
Apply batch API: 50% discount across the board
Cache the 1,500-token system prompt at 85% hit rate: ~$0.08/M for cache reads
Tighten output prompt: reduce average output to 30 tokens

Input (dynamic, 300 tokens): 50,000 × 300 = 15M × $0.80/M × 0.5 (batch) = $6
Input (cached, 1,500 tokens): 15% write × $1/M × 0.5 + 85% read × $0.08/M × 0.5 = $0.56 + $0.51 = $1.07
Output: 50,000 × 30 tokens = 1.5M × $4/M × 0.5 (batch) = $3
Total: ~$10.07/day (~$302/month)

That’s a 96.7% cost reduction. Even at half that efficiency — if only some tasks qualify for Haiku, or your cache hit rate is lower — you’re still looking at 80%+ savings. The math is brutal in your favor once you stack the techniques.

When Each Strategy Fails (And What to Do About It)

Prompt caching breaks down when your prompts are highly dynamic — if your “system prompt” changes meaningfully per user, there’s nothing stable to cache. The fix is to separate the static instructions from the dynamic personalization and only cache the static block.

Batch processing is useless for latency-sensitive workloads. Don’t batch anything a user is waiting on. And be aware that the 24-hour SLA is a ceiling, not a typical time — most batches complete in under 2 hours, but you can’t promise that to a user.

Model downgrading fails when you under-estimate task complexity. A classification task that seems simple but requires understanding subtle domain context might degrade noticeably on Haiku. Always A/B test on a representative sample of real production data before switching models wholesale. For complex multi-step agent workflows where model failures cascade, see our patterns for LLM fallback and retry logic — model routing and fallback logic work well together.

Bottom Line: Who Should Do What First

Solo founder or small team with a tight budget: Start with model routing immediately — no infrastructure changes required, just change the model string and run an eval. Then add prompt caching to your largest system prompt. You can realistically cut costs by 60–70% in a single afternoon.

Team running document/data pipelines: Batch API first. It’s a guaranteed 50% discount with minimal code changes, and it’s designed exactly for your use case. Then layer in Haiku for your simpler pipeline stages.

Enterprise with high-volume, latency-sensitive workloads: Invest in a proper model routing layer. Classify task complexity programmatically, route to Haiku/Sonnet/Opus based on requirements, and cache aggressively. The engineering effort pays back within a few billing cycles.

The techniques to reduce Claude API costs aren’t experimental — they’re documented, production-ready, and Anthropic actively wants you to use them. The only reason most teams haven’t applied them is they haven’t prioritized the audit. Start with your largest cost line item from your billing dashboard, apply the relevant technique, and measure the delta. It compounds fast.

Frequently Asked Questions

How does Claude prompt caching work and what’s the minimum token threshold?

Claude’s prompt caching requires a minimum of 1,024 tokens in the cacheable block (2,048 for Opus models). You mark the cache boundary with cache_control: {"type": "ephemeral"} at the end of the content block you want cached. The cache persists for 5 minutes from last use, so repeated calls within that window pay only 10% of the normal input token price for those tokens.

Can I use the Batch API for real-time user-facing features?

No — the Batch API is strictly asynchronous with up to 24 hours turnaround, making it unsuitable for anything a user is actively waiting on. It’s designed for background jobs: nightly processing, bulk analysis, document ingestion pipelines, and offline enrichment tasks. For user-facing workflows, use the standard synchronous API with caching applied.

What tasks can I safely move from Claude Sonnet to Haiku without quality loss?

Classification, intent detection, structured data extraction from well-formatted documents, entity extraction from short texts, JSON formatting, and simple routing decisions all perform near-identically on Haiku vs Sonnet in my testing. The tasks where Sonnet clearly outperforms include complex multi-step reasoning, nuanced writing, intricate code generation, and anything requiring deep inference across long contexts. Always eval on real production samples before switching.

Can I combine prompt caching and the Batch API at the same time?

Yes, and this is one of the highest-leverage combinations available. Batch API requests support the same cache_control syntax as standard API requests. Within a batch, if multiple requests share the same cacheable prefix (e.g., the same system prompt), cache hits apply. This means you get the 50% batch discount plus the 90% input discount on cached tokens simultaneously.

How do I measure whether my prompt caching is actually working?

Check the usage object in the API response — it includes cache_creation_input_tokens (tokens written to cache, billed at 125% of standard) and cache_read_input_tokens (tokens served from cache, billed at 10% of standard). If you’re seeing zero cache reads after the first request, your cache boundary is likely positioned incorrectly — it must be placed after the stable content, before the dynamic content.

Does switching models affect my existing prompts and outputs?

Yes, and this is the main risk. Haiku follows instructions well but can be more literal and less forgiving of ambiguous prompts. If your Sonnet prompts rely on the model inferring unstated context or handling edge cases gracefully, you may need to make them more explicit for Haiku. Run a structured eval on 100–200 real examples before migrating any production workload to a smaller model.

Put this into practice

Try the Api Security Audit agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

How to cut Claude API costs in half: caching, batching, and model selection strategies

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

How to cut Claude API costs in half: caching, batching, and model selection strategies

1. Prompt Caching: The Highest-Leverage Change You Can Make Today

Where cache hits actually accumulate

2. Batch Processing: A Guaranteed 50% Discount With One Architectural Change

What the numbers look like at scale

3. Model Routing: Haiku vs Sonnet vs Opus — Pick the Right Tool

Tasks where Haiku matches Sonnet in practice

Implementing a model router

4. Output Token Compression: The Underrated Lever

Stacking All Four: What the Combined Savings Look Like

When Each Strategy Fails (And What to Do About It)

Bottom Line: Who Should Do What First

Frequently Asked Questions

How does Claude prompt caching work and what’s the minimum token threshold?

Can I use the Batch API for real-time user-facing features?

What tasks can I safely move from Claude Sonnet to Haiku without quality loss?

Can I combine prompt caching and the Batch API at the same time?

How do I measure whether my prompt caching is actually working?

Does switching models affect my existing prompts and outputs?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation