If you’ve shipped an LLM-powered feature to production and watched your API bill climb in ways you didn’t anticipate, you already understand why LLM API cost management needs to be a first-class concern — not an afterthought you bolt on after things go sideways. A single runaway agent loop, a prompt template that’s 800 tokens fatter than it needs to be, or a model tier mismatch can multiply your costs by 5–10x overnight. I’ve seen it happen, and the fix is almost always architectural: cost awareness baked in from the start, not patched in after.
This article covers how to actually do that — with working code, real numbers, and the tradeoffs that matter when you’re running LLM workloads at any meaningful scale.
Understanding Where Your Token Budget Actually Goes
Before you can optimize anything, you need visibility. Most teams discover they have a cost problem through their invoice, which is the worst possible time. The breakdown is usually surprising the first time you see it clearly.
For a typical agent workflow with Claude Sonnet or GPT-4o, costs break down roughly like this:
- System prompt: repeated on every call, often 300–800 tokens you’re paying for constantly
- Conversation history: grows linearly with turns; a 10-turn conversation can carry 3,000+ tokens of context you’ve already paid for
- Tool definitions: if you’re using function calling with 8–10 tools, you’re adding 500–1,000 tokens per call just for the schema
- Output tokens: typically 20–30% of total cost but 3–5x more expensive per token on most providers
Output tokens cost more than input tokens almost everywhere. On Claude 3.5 Sonnet, input is $3/M tokens and output is $15/M. On GPT-4o, it’s $2.50 input vs $10 output. If your agent is generating verbose responses you’re then truncating client-side, you’re paying for tokens you’re throwing away.
Instrumenting Your LLM Calls Before You Deploy
The single most useful thing you can do before going to production is wrap your LLM client with a cost-tracking layer. Here’s a minimal version that works for both Anthropic and OpenAI:
import anthropic
import json
from datetime import datetime
from dataclasses import dataclass, field
from typing import Optional
# Pricing per million tokens (verify current prices before using)
PRICING = {
"claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
"claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
}
@dataclass
class CallRecord:
model: str
input_tokens: int
output_tokens: int
cost_usd: float
workflow: str
timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
class TrackedAnthropicClient:
def __init__(self, budget_usd: Optional[float] = None, workflow: str = "default"):
self.client = anthropic.Anthropic()
self.budget_usd = budget_usd
self.workflow = workflow
self.total_spent = 0.0
self.call_log: list[CallRecord] = []
def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
pricing = PRICING.get(model, {"input": 3.00, "output": 15.00})
return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
def create(self, **kwargs) -> anthropic.types.Message:
if self.budget_usd and self.total_spent >= self.budget_usd:
raise RuntimeError(
f"Budget cap reached: ${self.total_spent:.4f} spent of ${self.budget_usd:.2f} limit"
)
response = self.client.messages.create(**kwargs)
model = kwargs.get("model", "unknown")
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
cost = self._calculate_cost(model, input_tokens, output_tokens)
self.total_spent += cost
self.call_log.append(CallRecord(
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost_usd=cost,
workflow=self.workflow
))
# Warn at 80% of budget
if self.budget_usd and self.total_spent >= self.budget_usd * 0.8:
print(f"[COST WARN] ${self.total_spent:.4f} / ${self.budget_usd:.2f} budget used")
return response
def summary(self) -> dict:
return {
"total_calls": len(self.call_log),
"total_input_tokens": sum(r.input_tokens for r in self.call_log),
"total_output_tokens": sum(r.output_tokens for r in self.call_log),
"total_cost_usd": round(self.total_spent, 6),
"calls": [vars(r) for r in self.call_log]
}
Usage is straightforward — swap out your bare client for the tracked one:
# Set a $0.50 budget for this workflow run
client = TrackedAnthropicClient(budget_usd=0.50, workflow="email_triage")
response = client.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Summarise this email thread..."}]
)
print(json.dumps(client.summary(), indent=2))
This costs nothing to run and gives you per-call cost attribution that you can push to your observability stack, a database, or just log to stdout while you’re figuring out your baseline.
Model Selection as a Cost Lever
The biggest cost lever you have isn’t prompt compression or caching — it’s model selection. The difference between using Claude Opus and Claude Haiku on the same task is roughly 60x in cost. Most tasks don’t need the most capable model.
A practical tiering approach
I’d structure model selection like this for a typical agent system:
- Haiku / GPT-4o-mini / Gemini Flash: classification, routing, summarization of short content, structured data extraction from well-formatted input. Around $0.0002–0.0005 per typical call.
- Sonnet / GPT-4o: default for most reasoning tasks, code generation, complex summarization, anything where Haiku noticeably fails. Around $0.003–0.010 per typical call.
- Opus / GPT-4-turbo: only when you need maximum reasoning depth, complex multi-step problems, or you’re pre-generating cached outputs. Around $0.015–0.030+ per typical call.
The failure mode to avoid is defaulting to the most capable model “just to be safe.” That impulse is expensive. Run your task through Haiku first; if accuracy is acceptable, ship it. If not, step up. A routing classifier that costs $0.0002 per call to direct tasks to the right model tier pays for itself in a few hundred calls.
Prompt Compression and Caching Strategies
Once you’ve got the right model, focus on token reduction. The two highest-ROI techniques are prompt caching and context trimming.
Prompt caching (Anthropic-specific)
Anthropic’s prompt caching lets you mark static parts of your prompt with a cache_control parameter. Cached input tokens are billed at 10% of normal input cost on re-use, with a one-time 25% surcharge to write the cache entry. If your system prompt is 2,000 tokens and gets called 1,000 times, uncached that’s $6.00 in input costs; cached it drops to ~$0.65.
response = client.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert code reviewer...[your 2000-token system prompt here]",
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[{"role": "user", "content": user_code}]
)
Cache lifetime is 5 minutes (ephemeral), which works well for burst workloads. The cache is per-model and per-account, not shared across API keys. Check response.usage.cache_read_input_tokens to confirm hits — if you’re seeing zeros, your cache isn’t being hit and you need to verify the cache block is stable across calls.
Context window management for agents
Multi-turn agent conversations get expensive fast. A conversation that runs 20 turns with 500 tokens per turn accumulates 10,000 tokens of history you’re resending every single call. Three approaches that work in production:
- Summarization checkpoints: every N turns, ask the model to summarize the conversation so far, then replace the full history with the summary. Loses some fidelity, but usually acceptable.
- Sliding window: keep only the last K turns in context. Simple, predictable cost, but the model loses early context.
- Selective retention: tag important messages (decisions, outputs, tool results) and only include those in context, not every conversational turn.
For most agents I’d start with a sliding window of 6–8 turns plus a summarized header for earlier context. It’s not perfect but it’s controllable.
Setting Up Budget Alerts and Hard Limits
Anthropic, OpenAI, and most providers let you set spend limits and email alerts in their dashboard. Set them. But dashboard limits are coarse — they operate at the account level, not the workflow or customer level. If you’re building a multi-tenant product, you need per-customer budget enforcement in your code, not just at the API account level.
The tracked client above handles per-run budgets. For per-customer or per-day limits, push your call records to a database and check aggregates before each call:
def check_customer_budget(customer_id: str, db, daily_limit_usd: float = 1.00) -> bool:
"""Returns True if customer is within daily budget."""
today = datetime.utcnow().date().isoformat()
spent = db.query(
"SELECT COALESCE(SUM(cost_usd), 0) FROM llm_calls "
"WHERE customer_id = ? AND DATE(timestamp) = ?",
(customer_id, today)
).fetchone()[0]
return spent < daily_limit_usd
This query is fast with a composite index on (customer_id, timestamp). At scale, move to Redis counters with TTL — a single INCRBYFLOAT per call, key expires at midnight.
What Actually Moves the Needle in Production
After running these systems in production, here’s the honest priority order for cost reduction efforts:
- Model tiering — biggest impact, often 5–20x reduction on the right tasks
- Prompt caching — 60–90% reduction on repeated system prompts, near-zero effort to implement
- Output length control — set explicit
max_tokensappropriate to the task; default outputs are often verbose - Async batching — if latency doesn’t matter, batch requests; some providers offer batch pricing (Anthropic’s batch API is 50% cheaper)
- Context trimming — high ROI for long-running agents, more engineering work
- Response caching — cache identical or near-identical prompts at the application layer with Redis; works well for FAQ-type queries
What I’ve seen teams waste time on: over-engineering prompt compression with embedding-based retrieval when a sliding window would do, and premature investment in custom inference infrastructure before validating the product works.
Who Should Prioritize What
Solo founders and early-stage products: Start with the tracked client wrapper above, set hard budget caps, and default to Haiku or GPT-4o-mini. You can always step up model tier when you hit accuracy limits. Don’t spend engineering cycles on sophisticated caching infrastructure until you have predictable traffic.
Teams shipping multi-tenant products: Per-customer budget enforcement is non-negotiable. The Redis counter approach is what I’d reach for. Add prompt caching immediately for any system prompt over 500 tokens — the ROI is immediate and the implementation is one parameter.
High-volume automation builders (n8n, Make, etc.): Most workflow automation tasks — classification, extraction, routing — are Haiku-tier work. Audit your workflows and find every place you’re running a capable model on a simple task. Each swap pays for itself in days at volume.
Enterprise teams: Invest in centralized cost observability before you scale. You need per-workflow, per-team, and per-customer attribution in a single dashboard. Build or buy that instrumentation layer early. The tooling overhead is worth it — LLM API cost management at enterprise scale is fundamentally an observability problem.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

