Building an LLM Cost Calculator: Tracking Spend Across Models and Endpoints

If you’re running LLMs in production and you don’t have cost tracking in place, you’re flying blind. I’ve seen founders get hit with $800 API bills from a single runaway agent loop that nobody noticed for three days. Proper LLM cost calculator tracking isn’t a nice-to-have — it’s the difference between a sustainable product and a financial surprise that kills your runway. This article walks you through building a real instrumentation layer: one that captures token usage per model, aggregates spend across endpoints, fires alerts before costs spiral, and gives you enough data to actually optimize.

Why Off-the-Shelf Monitoring Isn’t Enough

OpenAI’s dashboard shows you monthly spend. Anthropic’s console gives you token totals. That’s roughly where the built-in tooling ends. Neither tells you which feature is expensive, which user is hammering the API, or whether your summarization prompt is somehow using 3x more tokens than it did last week.

The gap matters a lot once you’re past the prototype stage. You need per-request cost attribution — tagged by user, workflow, feature, or agent — and you need it in a system you control. What I’m going to show you is a lightweight wrapper that works across Claude, GPT-4, and any other OpenAI-compatible endpoint, stores cost data in SQLite (swappable for Postgres), and exposes a simple dashboard query.

The Cost Model: What You’re Actually Paying For

Every major provider bills on tokens, but the rates vary enough to matter. Here are the numbers as of mid-2025 — verify these before building your pricing table, they move:

Claude 3.5 Haiku: ~$0.80 / 1M input tokens, ~$4.00 / 1M output tokens
Claude 3.5 Sonnet: ~$3.00 / 1M input, ~$15.00 / 1M output
GPT-4o: ~$2.50 / 1M input, ~$10.00 / 1M output
GPT-4o mini: ~$0.15 / 1M input, ~$0.60 / 1M output
Gemini 1.5 Flash: ~$0.075 / 1M input, ~$0.30 / 1M output

The input/output split is where people get burned. Output tokens are consistently 3–5x more expensive than input tokens. If your prompt engineering is generating verbose responses when you don’t need them, you’re paying a real premium. That asymmetry is also why few-shot examples in your system prompt cost less than you think — they’re input tokens at the cheaper rate.

Building the Tracker: Core Architecture

The design is simple: a wrapper class that intercepts API calls, calculates cost from the token counts in the response, and writes a record to a local database. You tag each call with metadata (user ID, feature name, workflow) at call time. Everything else — dashboards, alerts, rollups — reads from that table.

The Cost Calculator Module

import sqlite3
import time
from dataclasses import dataclass
from typing import Optional

# Pricing in USD per 1M tokens — update this dict when rates change
PRICING = {
    "claude-3-5-haiku-20241022":  {"input": 0.80,  "output": 4.00},
    "claude-3-5-sonnet-20241022": {"input": 3.00,  "output": 15.00},
    "gpt-4o":                     {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini":                {"input": 0.15,  "output": 0.60},
    "gemini-1.5-flash":           {"input": 0.075, "output": 0.30},
}

@dataclass
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    feature: str
    user_id: Optional[str]
    latency_ms: int
    timestamp: float

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Returns cost in USD. Falls back to gpt-4o pricing if model unknown."""
    rates = PRICING.get(model, PRICING["gpt-4o"])
    input_cost  = (input_tokens  / 1_000_000) * rates["input"]
    output_cost = (output_tokens / 1_000_000) * rates["output"]
    return round(input_cost + output_cost, 8)

The fallback to GPT-4o pricing for unknown models is intentional — it’s conservative (more expensive), so your estimates skew high rather than low. You’d rather over-estimate than be surprised.

SQLite Schema and Writer

DB_PATH = "llm_costs.db"

def init_db():
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS llm_calls (
            id           INTEGER PRIMARY KEY AUTOINCREMENT,
            timestamp    REAL NOT NULL,
            model        TEXT NOT NULL,
            feature      TEXT NOT NULL,
            user_id      TEXT,
            input_tokens INTEGER NOT NULL,
            output_tokens INTEGER NOT NULL,
            cost_usd     REAL NOT NULL,
            latency_ms   INTEGER NOT NULL
        )
    """)
    # Index for the queries you'll actually run
    conn.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON llm_calls(timestamp)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_feature   ON llm_calls(feature)")
    conn.commit()
    conn.close()

def write_record(record: CostRecord):
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        INSERT INTO llm_calls
            (timestamp, model, feature, user_id, input_tokens, output_tokens, cost_usd, latency_ms)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        record.timestamp, record.model, record.feature, record.user_id,
        record.input_tokens, record.output_tokens, record.cost_usd, record.latency_ms
    ))
    conn.commit()
    conn.close()

The Instrumented Wrapper

import anthropic
import openai

anthropic_client = anthropic.Anthropic()
openai_client    = openai.OpenAI()

def call_claude(
    messages: list,
    model: str = "claude-3-5-haiku-20241022",
    feature: str = "default",
    user_id: Optional[str] = None,
    **kwargs
) -> str:
    start = time.time()
    response = anthropic_client.messages.create(
        model=model,
        messages=messages,
        max_tokens=kwargs.get("max_tokens", 1024),
    )
    latency_ms = int((time.time() - start) * 1000)

    input_tokens  = response.usage.input_tokens
    output_tokens = response.usage.output_tokens
    cost = calculate_cost(model, input_tokens, output_tokens)

    write_record(CostRecord(
        model=model, input_tokens=input_tokens, output_tokens=output_tokens,
        cost_usd=cost, feature=feature, user_id=user_id,
        latency_ms=latency_ms, timestamp=time.time()
    ))
    return response.content[0].text

def call_openai(
    messages: list,
    model: str = "gpt-4o-mini",
    feature: str = "default",
    user_id: Optional[str] = None,
    **kwargs
) -> str:
    start = time.time()
    response = openai_client.chat.completions.create(
        model=model, messages=messages,
        max_tokens=kwargs.get("max_tokens", 1024),
    )
    latency_ms = int((time.time() - start) * 1000)

    input_tokens  = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens
    cost = calculate_cost(model, input_tokens, output_tokens)

    write_record(CostRecord(
        model=model, input_tokens=input_tokens, output_tokens=output_tokens,
        cost_usd=cost, feature=feature, user_id=user_id,
        latency_ms=latency_ms, timestamp=time.time()
    ))
    return response.choices[0].message.content

Usage is just a drop-in replacement for your existing API calls. Pass feature="email_summarizer" and user_id=current_user.id and every call is attributed with no extra work downstream.

Querying the Data: Finding Where Your Money Goes

Raw inserts are worthless without queries you’ll actually run. Here are the four I check weekly on every production system:

def spend_by_feature(days: int = 7) -> list:
    """Which features are costing the most?"""
    since = time.time() - (days * 86400)
    conn = sqlite3.connect(DB_PATH)
    rows = conn.execute("""
        SELECT feature,
               COUNT(*)          AS calls,
               SUM(cost_usd)     AS total_cost,
               AVG(cost_usd)     AS avg_cost,
               SUM(output_tokens) AS total_output_tokens
        FROM llm_calls
        WHERE timestamp > ?
        GROUP BY feature
        ORDER BY total_cost DESC
    """, (since,)).fetchall()
    conn.close()
    return rows

def spend_by_model(days: int = 30) -> list:
    """Are you using expensive models where cheap ones would do?"""
    since = time.time() - (days * 86400)
    conn = sqlite3.connect(DB_PATH)
    rows = conn.execute("""
        SELECT model,
               COUNT(*)      AS calls,
               SUM(cost_usd) AS total_cost,
               AVG(latency_ms) AS avg_latency
        FROM llm_calls
        WHERE timestamp > ?
        GROUP BY model
        ORDER BY total_cost DESC
    """, (since,)).fetchall()
    conn.close()
    return rows

def top_spending_users(days: int = 7, limit: int = 10) -> list:
    """Who's burning budget? Could be a bad actor or a power user worth serving."""
    since = time.time() - (days * 86400)
    conn = sqlite3.connect(DB_PATH)
    rows = conn.execute("""
        SELECT user_id, SUM(cost_usd) AS total_cost, COUNT(*) AS calls
        FROM llm_calls
        WHERE timestamp > ? AND user_id IS NOT NULL
        GROUP BY user_id
        ORDER BY total_cost DESC
        LIMIT ?
    """, (since, limit)).fetchall()
    conn.close()
    return rows

Budget Alerts Before the Bill Arrives

Querying is reactive. You also want a proactive alert that fires when daily spend crosses a threshold. This runs as a cron job or as a background check in your application:

import smtplib
from email.mime.text import MIMEText

DAILY_BUDGET_USD = 20.0  # set this to something that would actually hurt

def check_daily_budget_alert():
    since = time.time() - 86400  # last 24 hours
    conn = sqlite3.connect(DB_PATH)
    result = conn.execute(
        "SELECT SUM(cost_usd) FROM llm_calls WHERE timestamp > ?", (since,)
    ).fetchone()
    conn.close()

    daily_spend = result[0] or 0.0
    if daily_spend > DAILY_BUDGET_USD:
        send_alert(
            subject=f"LLM spend alert: ${daily_spend:.2f} in last 24h",
            body=f"Daily budget is ${DAILY_BUDGET_USD}. Current spend: ${daily_spend:.2f}.\n\n"
                 f"Top features by cost:\n{spend_by_feature(days=1)}"
        )

def send_alert(subject: str, body: str):
    # Swap this for Slack webhook, PagerDuty, etc.
    msg = MIMEText(body)
    msg["Subject"] = subject
    msg["From"] = "alerts@yourdomain.com"
    msg["To"]   = "you@yourdomain.com"
    with smtplib.SMTP("localhost") as s:
        s.send_message(msg)

In practice I’d replace the email with a Slack webhook — it’s two lines and you’ll actually see it. The logic is the same either way.

Common Optimization Wins the Data Will Surface

Once you have a week of data, three patterns almost always show up:

Model overkill: Someone wired GPT-4o to a feature that just extracts structured data from short text. That’s a GPT-4o-mini or Claude Haiku job. Switching one feature like this typically cuts 60–80% of its cost with no quality difference. Your spend_by_feature query will point you directly at the candidate.

Verbose output tokens: If average output tokens on a feature are high, check whether your prompt specifies a format and length. Adding “respond in under 100 words” or “return valid JSON only, no explanation” to a system prompt costs you nothing and can halve your output token count.

Redundant calls: Agents often call the same LLM twice for things that could be one call. If you see high call counts on a feature relative to user actions, that’s a red flag. Add a simple cache layer for deterministic lookups — even a TTL cache for identical prompts cuts costs meaningfully in high-volume flows.

Scaling Up: When SQLite Isn’t Enough

SQLite handles roughly 10K writes/day without any performance issues on a single server. Beyond that, or if you have multiple workers, swap the connection for a Postgres pool — the schema and queries are identical. For high-throughput systems, batch inserts into a queue (Redis list, or even a local buffer flushed every 5 seconds) and write asynchronously. Don’t let cost tracking add latency to user-facing calls.

If you’re on n8n or Make for your automation workflows, you can POST cost data to this SQLite backend via a webhook node after each LLM call. The same schema works — just add a source column to differentiate API calls from workflow runs.

Who Should Build What

Solo founders building their first LLM feature: Start with the SQLite version above. It takes 30 minutes to wire in, costs nothing to run, and will immediately show you where money is going. Don’t over-engineer it until you have data that justifies something more complex.

Small teams with multiple services: Add a service tag alongside feature and push to a shared Postgres instance. Wire the daily budget alert into Slack. This is still three hours of work and buys you a lot of visibility.

Scaling products with high API volume: At this point you want time-series storage (InfluxDB or TimescaleDB), a proper dashboard in Grafana or Metabase, and per-user spend limits enforced at the application layer before calls are made. The tracker above is the data layer — the rest is infrastructure you probably already have.

The core principle doesn’t change at any scale: tag every call, store the token counts and cost, and query it regularly. Good LLM cost calculator tracking isn’t about fancy tooling — it’s about having attribution data you trust, close enough to real-time that you can act on it before the bill does.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building an LLM Cost Calculator: Tracking Spend Across Models and Endpoints

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building an LLM Cost Calculator: Tracking Spend Across Models and Endpoints

Why Off-the-Shelf Monitoring Isn’t Enough

The Cost Model: What You’re Actually Paying For

Building the Tracker: Core Architecture

The Cost Calculator Module

SQLite Schema and Writer

The Instrumented Wrapper

Querying the Data: Finding Where Your Money Goes

Budget Alerts Before the Bill Arrives

Common Optimization Wins the Data Will Surface

Scaling Up: When SQLite Isn’t Enough

Who Should Build What

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation