If you’re running Claude at any meaningful scale, you’ve probably opened your Anthropic billing dashboard and felt a small jolt of panic. A few agents, a document pipeline, a customer-facing feature — and suddenly you’re looking at hundreds of dollars a month with a clear upward trajectory. The good news: most teams are leaving 40–60% savings on the table because they haven’t applied the three core techniques that actually move the needle. This article is a precise walkthrough of how to reduce Claude API costs using prompt caching, batch processing, and intelligent model routing — with real numbers attached to each technique so you can prioritize what to implement first.
I’ll rank these by impact, cover the implementation details, and show you how stacking all three can cut your bill by more than half without touching quality in any meaningful way.
| Technique | Typical Savings | Best For | Effort |
|---|---|---|---|
| Prompt Caching | 40–90% on input tokens | Repeated system prompts, RAG context | Low |
| Batch Processing | 50% flat on all tokens | Async document pipelines | Medium |
| Model Downgrading | 80–95% on eligible tasks | Classification, extraction, routing | Medium |
| Output Compression | 20–40% on output tokens | Any generation-heavy workflow | Low |
1. Prompt Caching: The Highest-Leverage Change You Can Make Today
Prompt caching is Anthropic’s mechanism for reusing the KV cache from a previous request. When you mark a block of your prompt with cache_control: {"type": "ephemeral"}, the API stores the processed tokens for up to 5 minutes (extended with repeated use). Subsequent requests that hit that cache pay 10% of the normal input token price — a 90% discount on those tokens.
Cache write cost is 25% more than standard input price, so the math only works in your favor after roughly 1.1 cache hits per unique prompt. In practice, most agents hit that threshold within the first few requests of a session.
Where cache hits actually accumulate
The biggest wins come from three places: long system prompts, RAG document chunks injected into every request, and few-shot examples. If your system prompt is 2,000 tokens and you’re making 1,000 calls per day on Sonnet 3.5 (which costs $3/million input tokens), that’s $6/day in uncached system prompt cost. With caching and a realistic 80% hit rate, you pay $1.44/day on that component — saving $4.56 daily or roughly $137/month from one prompt block.
import anthropic
client = anthropic.Anthropic()
# Your long system prompt or document context — cache this
SYSTEM_PROMPT = """You are a senior financial analyst specialising in SaaS metrics...
[2000+ tokens of instructions, context, and examples here]
"""
def call_with_cache(user_message: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
# This tells the API: cache everything up to this point
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_message}
]
)
# Check cache performance in the response
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
return response.content[0].text
One important caveat: the cache is per-model and per-account, and it expires after 5 minutes of inactivity. For long-running pipelines with gaps, implement a keep-alive pattern — a lightweight dummy request every 4 minutes to reset the TTL. It costs ~$0.0001 per ping and saves the full cache re-write cost.
For RAG workflows, cache your retrieved document chunks before the user query, not after. The cache breakpoint must come before the dynamic content. If you’re building document processing pipelines, our RAG pipeline implementation guide covers how to structure your context windows for both quality and cost efficiency.
2. Batch Processing: A Guaranteed 50% Discount With One Architectural Change
The Anthropic Batch API gives you a flat 50% discount on all tokens — input and output — in exchange for async processing with up to 24-hour turnaround. This isn’t a trick or an edge case. It’s a first-class API designed for exactly this use case.
The constraint is that you cannot use it for anything requiring a synchronous response. User-facing chat: no. Background document processing: yes. Nightly report generation: yes. Bulk classification of last month’s support tickets: absolutely yes.
What the numbers look like at scale
Processing 10,000 documents with an average of 500 input tokens and 200 output tokens each on Claude Sonnet 3.5:
- Standard API: 5M input tokens × $3 + 2M output × $15 = $15 + $30 = $45
- Batch API: Same token counts at 50% = $22.50
That’s $22.50 saved on a single batch job. Run that pipeline daily and you’re saving $675/month from one change.
import anthropic
import json
client = anthropic.Anthropic()
def submit_batch(documents: list[dict]) -> str:
"""Submit documents for batch processing. Returns batch_id."""
requests = []
for i, doc in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}-{doc['id']}", # Use for matching results later
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 256,
"messages": [
{
"role": "user",
"content": f"Classify this support ticket. Return JSON with 'category' and 'priority'.\n\n{doc['text']}"
}
]
}
})
batch = client.messages.batches.create(requests=requests)
print(f"Batch submitted: {batch.id} — {len(requests)} requests")
return batch.id
def poll_and_collect(batch_id: str) -> list[dict]:
"""Poll until complete, then collect results."""
import time
while True:
batch = client.messages.batches.retrieve(batch_id)
if batch.processing_status == "ended":
break
print(f"Status: {batch.processing_status} — "
f"{batch.request_counts.processing} processing, "
f"{batch.request_counts.succeeded} done")
time.sleep(60) # Poll every minute — don't hammer the API
# Collect results
results = []
for result in client.messages.batches.results(batch_id):
if result.result.type == "succeeded":
results.append({
"id": result.custom_id,
"output": result.result.message.content[0].text
})
return results
We’ve written a deeper implementation guide covering error handling, retry logic for failed batch items, and throughput optimization in our article on batch processing workflows with Claude API. If you’re handling 10,000+ documents regularly, that’s required reading.
3. Model Routing: Haiku vs Sonnet vs Opus — Pick the Right Tool
This is where the biggest absolute dollar savings live, but it requires the most discipline. The price gap between models is enormous:
- Claude Haiku 3.5: $0.80/M input, $4/M output
- Claude Sonnet 3.5: $3/M input, $15/M output
- Claude Opus 4: $15/M input, $75/M output
Sonnet costs 3.75× more than Haiku. Opus costs nearly 19× more than Haiku. If you’re running Sonnet for tasks that Haiku handles identically — and there are many — you are paying a 3.75× premium for nothing.
Tasks where Haiku matches Sonnet in practice
After running both models on production workloads across multiple projects, I’d confidently route these to Haiku without hesitation:
- Binary or multi-class classification (sentiment, intent, category)
- Structured data extraction from well-formed documents
- Entity extraction from short texts
- Routing/triage decisions in agent pipelines
- Simple template-based generation with constrained outputs
- JSON formatting of already-extracted data
Keep Sonnet for tasks requiring reasoning, nuanced writing, complex code generation, or multi-step problem solving. See our Claude vs GPT-4 code generation benchmarks for a sense of where capability gaps actually matter in practice.
Implementing a model router
from enum import Enum
from dataclasses import dataclass
class TaskComplexity(Enum):
SIMPLE = "simple" # → Haiku
STANDARD = "standard" # → Sonnet
COMPLEX = "complex" # → Sonnet (or Opus for critical reasoning)
@dataclass
class ModelConfig:
model_id: str
cost_per_m_input: float
cost_per_m_output: float
MODEL_CONFIGS = {
TaskComplexity.SIMPLE: ModelConfig(
"claude-haiku-3-5", 0.80, 4.00
),
TaskComplexity.STANDARD: ModelConfig(
"claude-sonnet-4-5", 3.00, 15.00
),
TaskComplexity.COMPLEX: ModelConfig(
"claude-opus-4-5", 15.00, 75.00
),
}
def classify_task(task_type: str, context_length: int, requires_reasoning: bool) -> TaskComplexity:
"""Route tasks to the appropriate model tier."""
# Simple tasks: short context, no reasoning, structured output
if (task_type in {"classify", "extract", "route", "format"}
and context_length < 2000
and not requires_reasoning):
return TaskComplexity.SIMPLE
# Complex tasks: multi-step reasoning, long documents, creative work
if requires_reasoning or task_type in {"analyze", "synthesize", "debug_complex"}:
return TaskComplexity.STANDARD # Default to Sonnet; use Opus sparingly
return TaskComplexity.STANDARD
def get_model_for_task(task_type: str, context_length: int, requires_reasoning: bool = False) -> str:
complexity = classify_task(task_type, context_length, requires_reasoning)
config = MODEL_CONFIGS[complexity]
return config.model_id
The practical approach: audit your current API logs, group calls by task type, and estimate what percentage could drop to Haiku. In most document processing pipelines I’ve seen, 60–70% of calls are eligible. At that ratio, your blended input cost drops from $3/M to around $1.56/M — a 48% reduction on input alone, without touching output costs or adding any caching.
4. Output Token Compression: The Underrated Lever
Output tokens cost 5× more than input tokens on Sonnet. A response that runs 500 tokens when 150 would suffice is costing you 3.3× what it needs to. The fix is explicit prompt instruction — models comply reliably when told directly.
Add this to your system prompt for any task where you don’t need verbose prose:
“Be concise. Return only what is asked. Do not add preamble, explanation, or closing remarks unless explicitly requested. For structured tasks, return only the requested format.”
Pair this with max_tokens enforcement. Setting max_tokens=256 on a classification task doesn’t just cap runaway outputs — it signals to the model that brevity is expected, which tends to tighten the whole response.
For JSON extraction tasks, specifying the exact schema in your prompt (rather than describing it loosely) reduces both token count and the likelihood of hallucinated fields. If you’re building extraction pipelines that need to stay grounded, our guide on reducing LLM hallucinations in production covers structured output patterns that work at scale.
Stacking All Four: What the Combined Savings Look Like
Let’s take a realistic production pipeline: 50,000 document classification calls per day, each with a 1,500-token system prompt and 300-token document, returning a 50-token JSON classification. Currently running on Sonnet with no caching.
Baseline daily cost (Sonnet, no optimizations):
- Input: 50,000 × 1,800 tokens = 90M tokens × $3/M = $270
- Output: 50,000 × 50 tokens = 2.5M tokens × $15/M = $37.50
- Total: $307.50/day (~$9,225/month)
After all four optimizations:
- Switch to Haiku (classification task, qualifies): input now $0.80/M, output $4/M
- Apply batch API: 50% discount across the board
- Cache the 1,500-token system prompt at 85% hit rate: ~$0.08/M for cache reads
- Tighten output prompt: reduce average output to 30 tokens
- Input (dynamic, 300 tokens): 50,000 × 300 = 15M × $0.80/M × 0.5 (batch) = $6
- Input (cached, 1,500 tokens): 15% write × $1/M × 0.5 + 85% read × $0.08/M × 0.5 = $0.56 + $0.51 = $1.07
- Output: 50,000 × 30 tokens = 1.5M × $4/M × 0.5 (batch) = $3
- Total: ~$10.07/day (~$302/month)
That’s a 96.7% cost reduction. Even at half that efficiency — if only some tasks qualify for Haiku, or your cache hit rate is lower — you’re still looking at 80%+ savings. The math is brutal in your favor once you stack the techniques.
When Each Strategy Fails (And What to Do About It)
Prompt caching breaks down when your prompts are highly dynamic — if your “system prompt” changes meaningfully per user, there’s nothing stable to cache. The fix is to separate the static instructions from the dynamic personalization and only cache the static block.
Batch processing is useless for latency-sensitive workloads. Don’t batch anything a user is waiting on. And be aware that the 24-hour SLA is a ceiling, not a typical time — most batches complete in under 2 hours, but you can’t promise that to a user.
Model downgrading fails when you under-estimate task complexity. A classification task that seems simple but requires understanding subtle domain context might degrade noticeably on Haiku. Always A/B test on a representative sample of real production data before switching models wholesale. For complex multi-step agent workflows where model failures cascade, see our patterns for LLM fallback and retry logic — model routing and fallback logic work well together.
Bottom Line: Who Should Do What First
Solo founder or small team with a tight budget: Start with model routing immediately — no infrastructure changes required, just change the model string and run an eval. Then add prompt caching to your largest system prompt. You can realistically cut costs by 60–70% in a single afternoon.
Team running document/data pipelines: Batch API first. It’s a guaranteed 50% discount with minimal code changes, and it’s designed exactly for your use case. Then layer in Haiku for your simpler pipeline stages.
Enterprise with high-volume, latency-sensitive workloads: Invest in a proper model routing layer. Classify task complexity programmatically, route to Haiku/Sonnet/Opus based on requirements, and cache aggressively. The engineering effort pays back within a few billing cycles.
The techniques to reduce Claude API costs aren’t experimental — they’re documented, production-ready, and Anthropic actively wants you to use them. The only reason most teams haven’t applied them is they haven’t prioritized the audit. Start with your largest cost line item from your billing dashboard, apply the relevant technique, and measure the delta. It compounds fast.
Frequently Asked Questions
How does Claude prompt caching work and what’s the minimum token threshold?
Claude’s prompt caching requires a minimum of 1,024 tokens in the cacheable block (2,048 for Opus models). You mark the cache boundary with cache_control: {"type": "ephemeral"} at the end of the content block you want cached. The cache persists for 5 minutes from last use, so repeated calls within that window pay only 10% of the normal input token price for those tokens.
Can I use the Batch API for real-time user-facing features?
No — the Batch API is strictly asynchronous with up to 24 hours turnaround, making it unsuitable for anything a user is actively waiting on. It’s designed for background jobs: nightly processing, bulk analysis, document ingestion pipelines, and offline enrichment tasks. For user-facing workflows, use the standard synchronous API with caching applied.
What tasks can I safely move from Claude Sonnet to Haiku without quality loss?
Classification, intent detection, structured data extraction from well-formatted documents, entity extraction from short texts, JSON formatting, and simple routing decisions all perform near-identically on Haiku vs Sonnet in my testing. The tasks where Sonnet clearly outperforms include complex multi-step reasoning, nuanced writing, intricate code generation, and anything requiring deep inference across long contexts. Always eval on real production samples before switching.
Can I combine prompt caching and the Batch API at the same time?
Yes, and this is one of the highest-leverage combinations available. Batch API requests support the same cache_control syntax as standard API requests. Within a batch, if multiple requests share the same cacheable prefix (e.g., the same system prompt), cache hits apply. This means you get the 50% batch discount plus the 90% input discount on cached tokens simultaneously.
How do I measure whether my prompt caching is actually working?
Check the usage object in the API response — it includes cache_creation_input_tokens (tokens written to cache, billed at 125% of standard) and cache_read_input_tokens (tokens served from cache, billed at 10% of standard). If you’re seeing zero cache reads after the first request, your cache boundary is likely positioned incorrectly — it must be placed after the stable content, before the dynamic content.
Does switching models affect my existing prompts and outputs?
Yes, and this is the main risk. Haiku follows instructions well but can be more literal and less forgiving of ambiguous prompts. If your Sonnet prompts rely on the model inferring unstated context or handling edge cases gracefully, you may need to make them more explicit for Haiku. Run a structured eval on 100–200 real examples before migrating any production workload to a smaller model.
Put this into practice
Try the Api Security Audit agent — ready to use, no setup required.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

