If you’re running LLMs in production and you don’t have cost tracking in place, you’re flying blind. I’ve seen founders get hit with $800 API bills from a single runaway agent loop that nobody noticed for three days. Proper LLM cost calculator tracking isn’t a nice-to-have — it’s the difference between a sustainable product and a financial surprise that kills your runway. This article walks you through building a real instrumentation layer: one that captures token usage per model, aggregates spend across endpoints, fires alerts before costs spiral, and gives you enough data to actually optimize.
Why Off-the-Shelf Monitoring Isn’t Enough
OpenAI’s dashboard shows you monthly spend. Anthropic’s console gives you token totals. That’s roughly where the built-in tooling ends. Neither tells you which feature is expensive, which user is hammering the API, or whether your summarization prompt is somehow using 3x more tokens than it did last week.
The gap matters a lot once you’re past the prototype stage. You need per-request cost attribution — tagged by user, workflow, feature, or agent — and you need it in a system you control. What I’m going to show you is a lightweight wrapper that works across Claude, GPT-4, and any other OpenAI-compatible endpoint, stores cost data in SQLite (swappable for Postgres), and exposes a simple dashboard query.
The Cost Model: What You’re Actually Paying For
Every major provider bills on tokens, but the rates vary enough to matter. Here are the numbers as of mid-2025 — verify these before building your pricing table, they move:
- Claude 3.5 Haiku: ~$0.80 / 1M input tokens, ~$4.00 / 1M output tokens
- Claude 3.5 Sonnet: ~$3.00 / 1M input, ~$15.00 / 1M output
- GPT-4o: ~$2.50 / 1M input, ~$10.00 / 1M output
- GPT-4o mini: ~$0.15 / 1M input, ~$0.60 / 1M output
- Gemini 1.5 Flash: ~$0.075 / 1M input, ~$0.30 / 1M output
The input/output split is where people get burned. Output tokens are consistently 3–5x more expensive than input tokens. If your prompt engineering is generating verbose responses when you don’t need them, you’re paying a real premium. That asymmetry is also why few-shot examples in your system prompt cost less than you think — they’re input tokens at the cheaper rate.
Building the Tracker: Core Architecture
The design is simple: a wrapper class that intercepts API calls, calculates cost from the token counts in the response, and writes a record to a local database. You tag each call with metadata (user ID, feature name, workflow) at call time. Everything else — dashboards, alerts, rollups — reads from that table.
The Cost Calculator Module
import sqlite3
import time
from dataclasses import dataclass
from typing import Optional
# Pricing in USD per 1M tokens — update this dict when rates change
PRICING = {
"claude-3-5-haiku-20241022": {"input": 0.80, "output": 4.00},
"claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gemini-1.5-flash": {"input": 0.075, "output": 0.30},
}
@dataclass
class CostRecord:
model: str
input_tokens: int
output_tokens: int
cost_usd: float
feature: str
user_id: Optional[str]
latency_ms: int
timestamp: float
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""Returns cost in USD. Falls back to gpt-4o pricing if model unknown."""
rates = PRICING.get(model, PRICING["gpt-4o"])
input_cost = (input_tokens / 1_000_000) * rates["input"]
output_cost = (output_tokens / 1_000_000) * rates["output"]
return round(input_cost + output_cost, 8)
The fallback to GPT-4o pricing for unknown models is intentional — it’s conservative (more expensive), so your estimates skew high rather than low. You’d rather over-estimate than be surprised.
SQLite Schema and Writer
DB_PATH = "llm_costs.db"
def init_db():
conn = sqlite3.connect(DB_PATH)
conn.execute("""
CREATE TABLE IF NOT EXISTS llm_calls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp REAL NOT NULL,
model TEXT NOT NULL,
feature TEXT NOT NULL,
user_id TEXT,
input_tokens INTEGER NOT NULL,
output_tokens INTEGER NOT NULL,
cost_usd REAL NOT NULL,
latency_ms INTEGER NOT NULL
)
""")
# Index for the queries you'll actually run
conn.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON llm_calls(timestamp)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_feature ON llm_calls(feature)")
conn.commit()
conn.close()
def write_record(record: CostRecord):
conn = sqlite3.connect(DB_PATH)
conn.execute("""
INSERT INTO llm_calls
(timestamp, model, feature, user_id, input_tokens, output_tokens, cost_usd, latency_ms)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
record.timestamp, record.model, record.feature, record.user_id,
record.input_tokens, record.output_tokens, record.cost_usd, record.latency_ms
))
conn.commit()
conn.close()
The Instrumented Wrapper
import anthropic
import openai
anthropic_client = anthropic.Anthropic()
openai_client = openai.OpenAI()
def call_claude(
messages: list,
model: str = "claude-3-5-haiku-20241022",
feature: str = "default",
user_id: Optional[str] = None,
**kwargs
) -> str:
start = time.time()
response = anthropic_client.messages.create(
model=model,
messages=messages,
max_tokens=kwargs.get("max_tokens", 1024),
)
latency_ms = int((time.time() - start) * 1000)
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
cost = calculate_cost(model, input_tokens, output_tokens)
write_record(CostRecord(
model=model, input_tokens=input_tokens, output_tokens=output_tokens,
cost_usd=cost, feature=feature, user_id=user_id,
latency_ms=latency_ms, timestamp=time.time()
))
return response.content[0].text
def call_openai(
messages: list,
model: str = "gpt-4o-mini",
feature: str = "default",
user_id: Optional[str] = None,
**kwargs
) -> str:
start = time.time()
response = openai_client.chat.completions.create(
model=model, messages=messages,
max_tokens=kwargs.get("max_tokens", 1024),
)
latency_ms = int((time.time() - start) * 1000)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
cost = calculate_cost(model, input_tokens, output_tokens)
write_record(CostRecord(
model=model, input_tokens=input_tokens, output_tokens=output_tokens,
cost_usd=cost, feature=feature, user_id=user_id,
latency_ms=latency_ms, timestamp=time.time()
))
return response.choices[0].message.content
Usage is just a drop-in replacement for your existing API calls. Pass feature="email_summarizer" and user_id=current_user.id and every call is attributed with no extra work downstream.
Querying the Data: Finding Where Your Money Goes
Raw inserts are worthless without queries you’ll actually run. Here are the four I check weekly on every production system:
def spend_by_feature(days: int = 7) -> list:
"""Which features are costing the most?"""
since = time.time() - (days * 86400)
conn = sqlite3.connect(DB_PATH)
rows = conn.execute("""
SELECT feature,
COUNT(*) AS calls,
SUM(cost_usd) AS total_cost,
AVG(cost_usd) AS avg_cost,
SUM(output_tokens) AS total_output_tokens
FROM llm_calls
WHERE timestamp > ?
GROUP BY feature
ORDER BY total_cost DESC
""", (since,)).fetchall()
conn.close()
return rows
def spend_by_model(days: int = 30) -> list:
"""Are you using expensive models where cheap ones would do?"""
since = time.time() - (days * 86400)
conn = sqlite3.connect(DB_PATH)
rows = conn.execute("""
SELECT model,
COUNT(*) AS calls,
SUM(cost_usd) AS total_cost,
AVG(latency_ms) AS avg_latency
FROM llm_calls
WHERE timestamp > ?
GROUP BY model
ORDER BY total_cost DESC
""", (since,)).fetchall()
conn.close()
return rows
def top_spending_users(days: int = 7, limit: int = 10) -> list:
"""Who's burning budget? Could be a bad actor or a power user worth serving."""
since = time.time() - (days * 86400)
conn = sqlite3.connect(DB_PATH)
rows = conn.execute("""
SELECT user_id, SUM(cost_usd) AS total_cost, COUNT(*) AS calls
FROM llm_calls
WHERE timestamp > ? AND user_id IS NOT NULL
GROUP BY user_id
ORDER BY total_cost DESC
LIMIT ?
""", (since, limit)).fetchall()
conn.close()
return rows
Budget Alerts Before the Bill Arrives
Querying is reactive. You also want a proactive alert that fires when daily spend crosses a threshold. This runs as a cron job or as a background check in your application:
import smtplib
from email.mime.text import MIMEText
DAILY_BUDGET_USD = 20.0 # set this to something that would actually hurt
def check_daily_budget_alert():
since = time.time() - 86400 # last 24 hours
conn = sqlite3.connect(DB_PATH)
result = conn.execute(
"SELECT SUM(cost_usd) FROM llm_calls WHERE timestamp > ?", (since,)
).fetchone()
conn.close()
daily_spend = result[0] or 0.0
if daily_spend > DAILY_BUDGET_USD:
send_alert(
subject=f"LLM spend alert: ${daily_spend:.2f} in last 24h",
body=f"Daily budget is ${DAILY_BUDGET_USD}. Current spend: ${daily_spend:.2f}.\n\n"
f"Top features by cost:\n{spend_by_feature(days=1)}"
)
def send_alert(subject: str, body: str):
# Swap this for Slack webhook, PagerDuty, etc.
msg = MIMEText(body)
msg["Subject"] = subject
msg["From"] = "alerts@yourdomain.com"
msg["To"] = "you@yourdomain.com"
with smtplib.SMTP("localhost") as s:
s.send_message(msg)
In practice I’d replace the email with a Slack webhook — it’s two lines and you’ll actually see it. The logic is the same either way.
Common Optimization Wins the Data Will Surface
Once you have a week of data, three patterns almost always show up:
Model overkill: Someone wired GPT-4o to a feature that just extracts structured data from short text. That’s a GPT-4o-mini or Claude Haiku job. Switching one feature like this typically cuts 60–80% of its cost with no quality difference. Your spend_by_feature query will point you directly at the candidate.
Verbose output tokens: If average output tokens on a feature are high, check whether your prompt specifies a format and length. Adding “respond in under 100 words” or “return valid JSON only, no explanation” to a system prompt costs you nothing and can halve your output token count.
Redundant calls: Agents often call the same LLM twice for things that could be one call. If you see high call counts on a feature relative to user actions, that’s a red flag. Add a simple cache layer for deterministic lookups — even a TTL cache for identical prompts cuts costs meaningfully in high-volume flows.
Scaling Up: When SQLite Isn’t Enough
SQLite handles roughly 10K writes/day without any performance issues on a single server. Beyond that, or if you have multiple workers, swap the connection for a Postgres pool — the schema and queries are identical. For high-throughput systems, batch inserts into a queue (Redis list, or even a local buffer flushed every 5 seconds) and write asynchronously. Don’t let cost tracking add latency to user-facing calls.
If you’re on n8n or Make for your automation workflows, you can POST cost data to this SQLite backend via a webhook node after each LLM call. The same schema works — just add a source column to differentiate API calls from workflow runs.
Who Should Build What
Solo founders building their first LLM feature: Start with the SQLite version above. It takes 30 minutes to wire in, costs nothing to run, and will immediately show you where money is going. Don’t over-engineer it until you have data that justifies something more complex.
Small teams with multiple services: Add a service tag alongside feature and push to a shared Postgres instance. Wire the daily budget alert into Slack. This is still three hours of work and buys you a lot of visibility.
Scaling products with high API volume: At this point you want time-series storage (InfluxDB or TimescaleDB), a proper dashboard in Grafana or Metabase, and per-user spend limits enforced at the application layer before calls are made. The tracker above is the data layer — the rest is infrastructure you probably already have.
The core principle doesn’t change at any scale: tag every call, store the token counts and cost, and query it regularly. Good LLM cost calculator tracking isn’t about fancy tooling — it’s about having attribution data you trust, close enough to real-time that you can act on it before the bill does.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.
