If you’re running LLM-powered features in production and haven’t looked at your token spend recently, you’re probably leaving real money on the table. LLM prompt caching costs — or rather, the lack of caching — are responsible for a disproportionate chunk of most teams’ API bills. I’ve seen production RAG pipelines cut their monthly spend by 40% in a single afternoon by implementing two of the patterns below. This article covers the three approaches that actually move the needle: Claude’s native prompt prefix caching, response memoization for repeated queries, and vector-layer caching for RAG workflows.
Before diving in, let’s frame the economics. At Claude 3.5 Sonnet pricing, you’re paying $3 per million input tokens and $15 per million output tokens. If your system prompt is 2,000 tokens and you’re handling 10,000 requests per day, that’s 20 million system prompt tokens daily — roughly $60/day or ~$1,800/month on system prompts alone. With cache hits, that same traffic costs a fraction. These aren’t rounding errors.
How Claude’s Native Prompt Prefix Caching Works
Anthropic released prompt caching for Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku. The mechanism is straightforward: you mark specific blocks in your prompt with a cache_control parameter, and Anthropic’s infrastructure stores the processed KV cache for those blocks for 5 minutes (with a sliding window that resets on each hit). Subsequent requests that share the same cached prefix skip recomputation entirely.
Cached tokens cost $0.30 per million to read (vs $3.00 standard) — a 90% reduction. Writing to cache costs $3.75 per million tokens, so your break-even is roughly 1.3 cache hits per unique prompt. After that, you’re saving on every request.
The Implementation
Here’s a minimal working example using the Python SDK:
import anthropic
client = anthropic.Anthropic()
# This system prompt will be cached after the first request
SYSTEM_PROMPT = """You are a senior code reviewer specializing in Python.
Your job is to review pull requests, identify bugs, suggest improvements,
and flag security vulnerabilities. Always structure your response as:
1. Critical issues (must fix)
2. Suggestions (should fix)
3. Nitpicks (optional)
[... imagine 1500 more tokens of detailed instructions and examples ...]
"""
def review_code(code_diff: str) -> str:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": f"Review this diff:\n\n{code_diff}"}
]
)
# Check cache performance in the response
usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Regular input tokens: {usage.input_tokens}")
return response.content[0].text
A few things the documentation undersells: the 5-minute TTL resets on every cache hit, so high-traffic endpoints will effectively keep the cache warm indefinitely. The minimum cacheable block is 1,024 tokens for Sonnet and Opus, 2,048 for Haiku — don’t bother caching short system prompts. Also, cache is per-API-key and per-model, so staging and production share nothing even with identical prompts.
What You Can Cache (and What You Can’t)
You can cache multiple blocks in a single request — system prompt, few-shot examples, and even large document context. The cacheable content must be at the beginning of the prompt; anything after the first non-cached content won’t be cached. This matters for RAG: if you’re injecting retrieved documents before your instructions, you need to restructure your prompt to put static content first.
# Bad: retrieved docs come before static instructions (can't cache the instructions)
messages = [
{"role": "user", "content": f"{retrieved_docs}\n\nGiven the above, answer: {question}"}
]
# Better: structure allows caching the large static system prompt
system = [
{
"type": "text",
"text": DETAILED_INSTRUCTIONS, # 2000+ tokens, cacheable
"cache_control": {"type": "ephemeral"}
}
]
messages = [
{"role": "user", "content": f"Context:\n{retrieved_docs}\n\nQuestion: {question}"}
]
Response Memoization: Cache the Output, Not Just the Input
Native prompt caching handles the compute side. Response memoization handles the redundancy side. If you’re building anything with user queries — support bots, Q&A systems, code generators — a surprising percentage of queries are semantically identical or near-identical. Memoizing responses means you skip the API call entirely.
Exact-match memoization is trivial with Redis. The interesting version is fuzzy memoization using embeddings, which I’ll cover in the next section. For exact match:
import hashlib
import json
import redis
from anthropic import Anthropic
client = Anthropic()
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)
def get_cache_key(model: str, messages: list, system: str = "") -> str:
"""Generate a deterministic cache key from request parameters."""
payload = json.dumps({
"model": model,
"messages": messages,
"system": system
}, sort_keys=True)
return f"llm:response:{hashlib.sha256(payload.encode()).hexdigest()}"
def cached_completion(
messages: list,
model: str = "claude-3-5-sonnet-20241022",
system: str = "",
ttl: int = 3600, # 1 hour default
max_tokens: int = 1024
) -> dict:
cache_key = get_cache_key(model, messages, system)
# Try cache first
cached = cache.get(cache_key)
if cached:
result = json.loads(cached)
result["cache_hit"] = True
return result
# Miss — call the API
kwargs = {"model": model, "max_tokens": max_tokens, "messages": messages}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
result = {
"content": response.content[0].text,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_hit": False
}
# Store with TTL
cache.setex(cache_key, ttl, json.dumps(result))
return result
The TTL question is where most people get it wrong. For deterministic tasks (code formatting, data extraction with fixed schemas), you can cache aggressively — hours or days. For anything that needs current information, keep TTLs short or skip memoization. I’ve seen teams accidentally serve stale responses for weeks because they set TTL to 86400 and forgot about it.
Realistic cache hit rates: For a support chatbot with ~500 distinct question patterns, exact-match hit rates hover around 15–25% depending on how you normalize queries (lowercase, strip punctuation, etc.). That alone can cut costs by 15–25% with zero quality tradeoff.
Semantic Caching for RAG: The Vector Layer
Exact-match memoization misses “What’s your refund policy?” and “How do I get a refund?” — semantically identical, textually different. Semantic caching fixes this by storing past queries as embeddings and doing a similarity lookup before hitting the LLM.
import numpy as np
from anthropic import Anthropic
import redis
import json
client = Anthropic()
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)
def get_embedding(text: str) -> list[float]:
"""Get embeddings — using OpenAI here, swap for your preferred provider."""
import openai
response = openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def semantic_cached_query(
query: str,
system: str,
similarity_threshold: float = 0.92, # tune this carefully
ttl: int = 7200
) -> dict:
query_embedding = get_embedding(query)
# Scan cached queries for semantic matches
# In production, use a vector DB (Pinecone, Qdrant) instead of Redis scan
best_match = None
best_score = 0.0
for key in cache.scan_iter("semantic:*"):
cached_data = json.loads(cache.get(key))
score = cosine_similarity(query_embedding, cached_data["embedding"])
if score > best_score:
best_score = score
best_match = cached_data
if best_score >= similarity_threshold:
return {
"content": best_match["response"],
"cache_hit": True,
"similarity": best_score,
"matched_query": best_match["query"]
}
# Cache miss — call Claude
response = client.messages.create(
model="claude-3-haiku-20240307", # cheaper model for cached-miss path
max_tokens=512,
system=system,
messages=[{"role": "user", "content": query}]
)
result_text = response.content[0].text
# Store in semantic cache
cache_entry = {
"query": query,
"embedding": query_embedding,
"response": result_text
}
import hashlib
key = f"semantic:{hashlib.md5(query.encode()).hexdigest()}"
cache.setex(key, ttl, json.dumps(cache_entry))
return {"content": result_text, "cache_hit": False, "similarity": 0.0}
The similarity threshold is critical. At 0.95, you’ll have high precision but miss many valid cache opportunities. At 0.88, you’ll get higher hit rates but start returning slightly mismatched responses. For factual Q&A, I’d start at 0.92 and measure. For creative tasks, don’t use semantic caching at all — users will notice repeated responses.
In production, don’t use Redis scan for this. It’s O(n) and will kill performance at scale. Use Qdrant, Pinecone, or pgvector with proper ANN indexing. Qdrant has a free tier and a clean Python client — it’s my default for this pattern.
Calculating Your Actual Savings
Here’s a concrete calculator. Plug in your numbers:
def calculate_monthly_savings(
daily_requests: int,
avg_system_prompt_tokens: int,
avg_user_tokens: int,
avg_output_tokens: int,
exact_match_hit_rate: float, # e.g. 0.20 for 20%
semantic_hit_rate: float, # additional hits from semantic cache
prompt_cache_hit_rate: float, # e.g. 0.85 for warm system prompt
model: str = "sonnet"
) -> dict:
PRICING = {
"sonnet": {
"input": 3.00 / 1e6,
"output": 15.00 / 1e6,
"cache_write": 3.75 / 1e6,
"cache_read": 0.30 / 1e6,
}
}
p = PRICING[model]
monthly_requests = daily_requests * 30
# Baseline cost (no caching)
total_input = monthly_requests * (avg_system_prompt_tokens + avg_user_tokens)
total_output = monthly_requests * avg_output_tokens
baseline_cost = (total_input * p["input"]) + (total_output * p["output"])
# After caching
served_from_exact = monthly_requests * exact_match_hit_rate
served_from_semantic = monthly_requests * semantic_hit_rate
llm_requests = monthly_requests * (1 - exact_match_hit_rate - semantic_hit_rate)
# For LLM requests, apply prompt caching
cache_miss_system = llm_requests * (1 - prompt_cache_hit_rate) * avg_system_prompt_tokens
cache_hit_system = llm_requests * prompt_cache_hit_rate * avg_system_prompt_tokens
user_tokens = llm_requests * avg_user_tokens
output_tokens = llm_requests * avg_output_tokens
cached_cost = (
(cache_miss_system * p["cache_write"]) +
(cache_hit_system * p["cache_read"]) +
(user_tokens * p["input"]) +
(output_tokens * p["output"])
)
savings = baseline_cost - cached_cost
return {
"baseline_monthly": round(baseline_cost, 2),
"cached_monthly": round(cached_cost, 2),
"monthly_savings": round(savings, 2),
"reduction_pct": round((savings / baseline_cost) * 100, 1)
}
# Example: support bot, 5000 req/day
result = calculate_monthly_savings(
daily_requests=5000,
avg_system_prompt_tokens=1500,
avg_user_tokens=100,
avg_output_tokens=200,
exact_match_hit_rate=0.20,
semantic_hit_rate=0.15,
prompt_cache_hit_rate=0.85
)
print(result)
# {'baseline_monthly': 1215.0, 'cached_monthly': 487.23,
# 'monthly_savings': 727.77, 'reduction_pct': 59.9}
What Actually Breaks in Production
A few failure modes worth knowing before you ship this:
- Cache poisoning via injection: If user input can influence cached content, an adversarial user could poison your cache. Always cache on normalized, validated inputs only.
- Stale semantic matches: Your product docs change; your cache doesn’t know. Build a cache invalidation strategy tied to your content update pipeline, not just TTL.
- Cold start problem: The first request after deployment always misses. For low-traffic apps, you may never warm the cache enough to see savings. This strategy pays off at 100+ requests/day minimum.
- Claude prompt caching and streaming: As of writing, prompt caching works with streaming but the cache_creation_input_tokens field may not appear in streaming usage chunks — you need to check the final message_delta event.
Which Caching Strategy Should You Implement First?
Solo founder or small team with a single product: Start with Claude’s native prompt prefix caching — it’s two lines of code and immediately cuts costs on any request sharing a large system prompt. You’ll see results this week.
Teams running support or Q&A bots: Add exact-match Redis memoization on top. The hit rate is higher than you’d expect, and the implementation is dead simple. Semantic caching is worth adding once you’ve measured exact-match hit rates and found the ceiling.
Teams running high-volume RAG pipelines: All three strategies apply. Restructure your prompts to cache static instructions, memoize exact queries, and layer in vector caching with Qdrant. This is where you’ll hit 40–50% cost reduction and it’s worth the engineering investment.
Managing LLM prompt caching costs isn’t a one-time optimization — it’s an ongoing practice. Instrument your cache hit rates from day one, set alerts when they drop unexpectedly, and revisit your TTL strategy as your prompt templates evolve. The teams spending the least on API calls aren’t using cheaper models; they’re using smarter caching.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

