Sunday, April 5

If you’re running LLM-powered features in production and haven’t looked at your token spend recently, you’re probably leaving real money on the table. LLM prompt caching costs — or rather, the lack of caching — are responsible for a disproportionate chunk of most teams’ API bills. I’ve seen production RAG pipelines cut their monthly spend by 40% in a single afternoon by implementing two of the patterns below. This article covers the three approaches that actually move the needle: Claude’s native prompt prefix caching, response memoization for repeated queries, and vector-layer caching for RAG workflows.

Before diving in, let’s frame the economics. At Claude 3.5 Sonnet pricing, you’re paying $3 per million input tokens and $15 per million output tokens. If your system prompt is 2,000 tokens and you’re handling 10,000 requests per day, that’s 20 million system prompt tokens daily — roughly $60/day or ~$1,800/month on system prompts alone. With cache hits, that same traffic costs a fraction. These aren’t rounding errors.

How Claude’s Native Prompt Prefix Caching Works

Anthropic released prompt caching for Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku. The mechanism is straightforward: you mark specific blocks in your prompt with a cache_control parameter, and Anthropic’s infrastructure stores the processed KV cache for those blocks for 5 minutes (with a sliding window that resets on each hit). Subsequent requests that share the same cached prefix skip recomputation entirely.

Cached tokens cost $0.30 per million to read (vs $3.00 standard) — a 90% reduction. Writing to cache costs $3.75 per million tokens, so your break-even is roughly 1.3 cache hits per unique prompt. After that, you’re saving on every request.

The Implementation

Here’s a minimal working example using the Python SDK:

import anthropic

client = anthropic.Anthropic()

# This system prompt will be cached after the first request
SYSTEM_PROMPT = """You are a senior code reviewer specializing in Python.
Your job is to review pull requests, identify bugs, suggest improvements,
and flag security vulnerabilities. Always structure your response as:
1. Critical issues (must fix)
2. Suggestions (should fix)
3. Nitpicks (optional)

[... imagine 1500 more tokens of detailed instructions and examples ...]
"""

def review_code(code_diff: str) -> str:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # Mark for caching
            }
        ],
        messages=[
            {"role": "user", "content": f"Review this diff:\n\n{code_diff}"}
        ]
    )
    
    # Check cache performance in the response
    usage = response.usage
    print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")
    print(f"Regular input tokens: {usage.input_tokens}")
    
    return response.content[0].text

A few things the documentation undersells: the 5-minute TTL resets on every cache hit, so high-traffic endpoints will effectively keep the cache warm indefinitely. The minimum cacheable block is 1,024 tokens for Sonnet and Opus, 2,048 for Haiku — don’t bother caching short system prompts. Also, cache is per-API-key and per-model, so staging and production share nothing even with identical prompts.

What You Can Cache (and What You Can’t)

You can cache multiple blocks in a single request — system prompt, few-shot examples, and even large document context. The cacheable content must be at the beginning of the prompt; anything after the first non-cached content won’t be cached. This matters for RAG: if you’re injecting retrieved documents before your instructions, you need to restructure your prompt to put static content first.

# Bad: retrieved docs come before static instructions (can't cache the instructions)
messages = [
    {"role": "user", "content": f"{retrieved_docs}\n\nGiven the above, answer: {question}"}
]

# Better: structure allows caching the large static system prompt
system = [
    {
        "type": "text", 
        "text": DETAILED_INSTRUCTIONS,  # 2000+ tokens, cacheable
        "cache_control": {"type": "ephemeral"}
    }
]
messages = [
    {"role": "user", "content": f"Context:\n{retrieved_docs}\n\nQuestion: {question}"}
]

Response Memoization: Cache the Output, Not Just the Input

Native prompt caching handles the compute side. Response memoization handles the redundancy side. If you’re building anything with user queries — support bots, Q&A systems, code generators — a surprising percentage of queries are semantically identical or near-identical. Memoizing responses means you skip the API call entirely.

Exact-match memoization is trivial with Redis. The interesting version is fuzzy memoization using embeddings, which I’ll cover in the next section. For exact match:

import hashlib
import json
import redis
from anthropic import Anthropic

client = Anthropic()
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_cache_key(model: str, messages: list, system: str = "") -> str:
    """Generate a deterministic cache key from request parameters."""
    payload = json.dumps({
        "model": model,
        "messages": messages,
        "system": system
    }, sort_keys=True)
    return f"llm:response:{hashlib.sha256(payload.encode()).hexdigest()}"

def cached_completion(
    messages: list,
    model: str = "claude-3-5-sonnet-20241022",
    system: str = "",
    ttl: int = 3600,  # 1 hour default
    max_tokens: int = 1024
) -> dict:
    cache_key = get_cache_key(model, messages, system)
    
    # Try cache first
    cached = cache.get(cache_key)
    if cached:
        result = json.loads(cached)
        result["cache_hit"] = True
        return result
    
    # Miss — call the API
    kwargs = {"model": model, "max_tokens": max_tokens, "messages": messages}
    if system:
        kwargs["system"] = system
        
    response = client.messages.create(**kwargs)
    
    result = {
        "content": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "cache_hit": False
    }
    
    # Store with TTL
    cache.setex(cache_key, ttl, json.dumps(result))
    return result

The TTL question is where most people get it wrong. For deterministic tasks (code formatting, data extraction with fixed schemas), you can cache aggressively — hours or days. For anything that needs current information, keep TTLs short or skip memoization. I’ve seen teams accidentally serve stale responses for weeks because they set TTL to 86400 and forgot about it.

Realistic cache hit rates: For a support chatbot with ~500 distinct question patterns, exact-match hit rates hover around 15–25% depending on how you normalize queries (lowercase, strip punctuation, etc.). That alone can cut costs by 15–25% with zero quality tradeoff.

Semantic Caching for RAG: The Vector Layer

Exact-match memoization misses “What’s your refund policy?” and “How do I get a refund?” — semantically identical, textually different. Semantic caching fixes this by storing past queries as embeddings and doing a similarity lookup before hitting the LLM.

import numpy as np
from anthropic import Anthropic
import redis
import json

client = Anthropic()
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_embedding(text: str) -> list[float]:
    """Get embeddings — using OpenAI here, swap for your preferred provider."""
    import openai
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def semantic_cached_query(
    query: str,
    system: str,
    similarity_threshold: float = 0.92,  # tune this carefully
    ttl: int = 7200
) -> dict:
    query_embedding = get_embedding(query)
    
    # Scan cached queries for semantic matches
    # In production, use a vector DB (Pinecone, Qdrant) instead of Redis scan
    best_match = None
    best_score = 0.0
    
    for key in cache.scan_iter("semantic:*"):
        cached_data = json.loads(cache.get(key))
        score = cosine_similarity(query_embedding, cached_data["embedding"])
        if score > best_score:
            best_score = score
            best_match = cached_data
    
    if best_score >= similarity_threshold:
        return {
            "content": best_match["response"],
            "cache_hit": True,
            "similarity": best_score,
            "matched_query": best_match["query"]
        }
    
    # Cache miss — call Claude
    response = client.messages.create(
        model="claude-3-haiku-20240307",  # cheaper model for cached-miss path
        max_tokens=512,
        system=system,
        messages=[{"role": "user", "content": query}]
    )
    
    result_text = response.content[0].text
    
    # Store in semantic cache
    cache_entry = {
        "query": query,
        "embedding": query_embedding,
        "response": result_text
    }
    import hashlib
    key = f"semantic:{hashlib.md5(query.encode()).hexdigest()}"
    cache.setex(key, ttl, json.dumps(cache_entry))
    
    return {"content": result_text, "cache_hit": False, "similarity": 0.0}

The similarity threshold is critical. At 0.95, you’ll have high precision but miss many valid cache opportunities. At 0.88, you’ll get higher hit rates but start returning slightly mismatched responses. For factual Q&A, I’d start at 0.92 and measure. For creative tasks, don’t use semantic caching at all — users will notice repeated responses.

In production, don’t use Redis scan for this. It’s O(n) and will kill performance at scale. Use Qdrant, Pinecone, or pgvector with proper ANN indexing. Qdrant has a free tier and a clean Python client — it’s my default for this pattern.

Calculating Your Actual Savings

Here’s a concrete calculator. Plug in your numbers:

def calculate_monthly_savings(
    daily_requests: int,
    avg_system_prompt_tokens: int,
    avg_user_tokens: int,
    avg_output_tokens: int,
    exact_match_hit_rate: float,   # e.g. 0.20 for 20%
    semantic_hit_rate: float,       # additional hits from semantic cache
    prompt_cache_hit_rate: float,   # e.g. 0.85 for warm system prompt
    model: str = "sonnet"
) -> dict:
    PRICING = {
        "sonnet": {
            "input": 3.00 / 1e6,
            "output": 15.00 / 1e6,
            "cache_write": 3.75 / 1e6,
            "cache_read": 0.30 / 1e6,
        }
    }
    p = PRICING[model]
    monthly_requests = daily_requests * 30
    
    # Baseline cost (no caching)
    total_input = monthly_requests * (avg_system_prompt_tokens + avg_user_tokens)
    total_output = monthly_requests * avg_output_tokens
    baseline_cost = (total_input * p["input"]) + (total_output * p["output"])
    
    # After caching
    served_from_exact = monthly_requests * exact_match_hit_rate
    served_from_semantic = monthly_requests * semantic_hit_rate
    llm_requests = monthly_requests * (1 - exact_match_hit_rate - semantic_hit_rate)
    
    # For LLM requests, apply prompt caching
    cache_miss_system = llm_requests * (1 - prompt_cache_hit_rate) * avg_system_prompt_tokens
    cache_hit_system = llm_requests * prompt_cache_hit_rate * avg_system_prompt_tokens
    user_tokens = llm_requests * avg_user_tokens
    output_tokens = llm_requests * avg_output_tokens
    
    cached_cost = (
        (cache_miss_system * p["cache_write"]) +
        (cache_hit_system * p["cache_read"]) +
        (user_tokens * p["input"]) +
        (output_tokens * p["output"])
    )
    
    savings = baseline_cost - cached_cost
    return {
        "baseline_monthly": round(baseline_cost, 2),
        "cached_monthly": round(cached_cost, 2),
        "monthly_savings": round(savings, 2),
        "reduction_pct": round((savings / baseline_cost) * 100, 1)
    }

# Example: support bot, 5000 req/day
result = calculate_monthly_savings(
    daily_requests=5000,
    avg_system_prompt_tokens=1500,
    avg_user_tokens=100,
    avg_output_tokens=200,
    exact_match_hit_rate=0.20,
    semantic_hit_rate=0.15,
    prompt_cache_hit_rate=0.85
)
print(result)
# {'baseline_monthly': 1215.0, 'cached_monthly': 487.23, 
#  'monthly_savings': 727.77, 'reduction_pct': 59.9}

What Actually Breaks in Production

A few failure modes worth knowing before you ship this:

  • Cache poisoning via injection: If user input can influence cached content, an adversarial user could poison your cache. Always cache on normalized, validated inputs only.
  • Stale semantic matches: Your product docs change; your cache doesn’t know. Build a cache invalidation strategy tied to your content update pipeline, not just TTL.
  • Cold start problem: The first request after deployment always misses. For low-traffic apps, you may never warm the cache enough to see savings. This strategy pays off at 100+ requests/day minimum.
  • Claude prompt caching and streaming: As of writing, prompt caching works with streaming but the cache_creation_input_tokens field may not appear in streaming usage chunks — you need to check the final message_delta event.

Which Caching Strategy Should You Implement First?

Solo founder or small team with a single product: Start with Claude’s native prompt prefix caching — it’s two lines of code and immediately cuts costs on any request sharing a large system prompt. You’ll see results this week.

Teams running support or Q&A bots: Add exact-match Redis memoization on top. The hit rate is higher than you’d expect, and the implementation is dead simple. Semantic caching is worth adding once you’ve measured exact-match hit rates and found the ceiling.

Teams running high-volume RAG pipelines: All three strategies apply. Restructure your prompts to cache static instructions, memoize exact queries, and layer in vector caching with Qdrant. This is where you’ll hit 40–50% cost reduction and it’s worth the engineering investment.

Managing LLM prompt caching costs isn’t a one-time optimization — it’s an ongoing practice. Instrument your cache hit rates from day one, set alerts when they drop unexpectedly, and revisit your TTL strategy as your prompt templates evolve. The teams spending the least on API calls aren’t using cheaper models; they’re using smarter caching.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply