Sunday, April 5

If you’re running more than two or three LLM-powered features in production, you’ve probably had the moment where you open your API billing page and feel mildly sick. Costs that seemed trivial in testing compound fast across agents, retries, long context windows, and multi-step chains. Building a proper LLM cost tracking calculator โ€” one that breaks down spend by model, endpoint, agent, and request type โ€” is one of the highest-leverage infrastructure investments you can make before you scale. This article walks through a complete implementation: token counting, per-request cost calculation, aggregated dashboards, and monthly forecasting, all wired together in Python with a lightweight SQLite backend.

Why Token-Level Cost Tracking Actually Matters

Most teams start by watching the aggregate billing number on their provider’s dashboard. That’s fine for a weekend project. For anything with multiple agents or models running concurrently, it’s nearly useless โ€” you can’t tell which agent is burning money, whether your prompt compression efforts made any difference, or whether a particular workflow is actually cost-efficient compared to a simpler alternative.

The breakdown you actually need looks like this:

  • Cost per request โ€” not just total spend
  • Input vs output token split โ€” output tokens cost 3-5x more on most models, so a chatty system prompt isn’t your main problem
  • Cost by agent or workflow โ€” so you know which automation is expensive, not just “the API”
  • Cost trend over time โ€” flat? growing linearly? spiking on certain days?
  • Forecasted monthly spend โ€” based on current run rate, not hope

The good news: every major LLM API returns token counts in the response. You’re not guessing โ€” you have the data. You just need to capture it.

Pricing Reference You’ll Need in Code

Prices shift, so pin these and check monthly. As of mid-2025, ballpark figures per million tokens (input / output):

  • Claude 3.5 Haiku: ~$0.80 / $4.00
  • Claude 3.5 Sonnet: ~$3.00 / $15.00
  • GPT-4o mini: ~$0.15 / $0.60
  • GPT-4o: ~$2.50 / $10.00
  • Gemini 1.5 Flash: ~$0.075 / $0.30
  • Llama 3 via Groq: ~$0.05 / $0.08 (varies by size)

The practical implication: a 1,000-token output from GPT-4o costs roughly $0.01. Run that 10,000 times a month and you’re at $100 from output tokens alone โ€” before you count the input. Your tracker needs to separate input and output costs or you’ll misattribute spend.

Building the Core Cost Tracker

Data Model and SQLite Schema

Start with a schema that captures everything you’ll want to query later. Retrofitting columns is painful.

import sqlite3
from datetime import datetime

def init_db(db_path: str = "llm_costs.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS llm_requests (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            ts          DATETIME DEFAULT CURRENT_TIMESTAMP,
            agent_id    TEXT NOT NULL,        -- e.g. "email-classifier", "support-bot"
            model       TEXT NOT NULL,        -- e.g. "claude-3-5-haiku-20241022"
            endpoint    TEXT,                 -- optional: "chat", "batch", "tools"
            input_tok   INTEGER NOT NULL,
            output_tok  INTEGER NOT NULL,
            input_cost  REAL NOT NULL,        -- in USD
            output_cost REAL NOT NULL,
            total_cost  REAL NOT NULL,
            latency_ms  INTEGER,
            success     INTEGER DEFAULT 1,    -- 0 = error/retry
            metadata    TEXT                  -- JSON blob for custom tags
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_agent ON llm_requests(agent_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_ts ON llm_requests(ts)")
    conn.commit()
    return conn

The Cost Calculator Function

A clean pricing registry makes it easy to add new models without touching the calculation logic.

from dataclasses import dataclass
from typing import Optional
import json, time

# Per-million-token pricing (input, output) โ€” verify before use
MODEL_PRICING = {
    "claude-3-5-haiku-20241022":    (0.80,  4.00),
    "claude-3-5-sonnet-20241022":   (3.00, 15.00),
    "gpt-4o-mini":                  (0.15,  0.60),
    "gpt-4o":                       (2.50, 10.00),
    "gemini-1.5-flash":             (0.075, 0.30),
    "llama-3.1-70b-versatile":      (0.059, 0.079),  # Groq pricing
}

@dataclass
class CostRecord:
    input_tok: int
    output_tok: int
    input_cost: float
    output_cost: float
    total_cost: float

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> CostRecord:
    if model not in MODEL_PRICING:
        raise ValueError(f"Unknown model: {model}. Add it to MODEL_PRICING.")
    
    price_in, price_out = MODEL_PRICING[model]
    # Pricing is per million tokens
    input_cost  = (input_tokens  / 1_000_000) * price_in
    output_cost = (output_tokens / 1_000_000) * price_out
    
    return CostRecord(
        input_tok=input_tokens,
        output_tok=output_tokens,
        input_cost=round(input_cost, 8),
        output_cost=round(output_cost, 8),
        total_cost=round(input_cost + output_cost, 8),
    )

def log_request(
    conn: sqlite3.Connection,
    agent_id: str,
    model: str,
    record: CostRecord,
    endpoint: Optional[str] = None,
    latency_ms: Optional[int] = None,
    success: bool = True,
    metadata: Optional[dict] = None,
):
    conn.execute("""
        INSERT INTO llm_requests
          (agent_id, model, endpoint, input_tok, output_tok,
           input_cost, output_cost, total_cost, latency_ms, success, metadata)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        agent_id, model, endpoint,
        record.input_tok, record.output_tok,
        record.input_cost, record.output_cost, record.total_cost,
        latency_ms, int(success),
        json.dumps(metadata) if metadata else None,
    ))
    conn.commit()

Wrapping Your API Calls

The cleanest integration is a thin wrapper around your existing API calls. Here’s one for Anthropic โ€” the pattern is identical for OpenAI, just swap usage.input_tokens for usage.prompt_tokens.

import anthropic, time

client = anthropic.Anthropic()

def tracked_completion(
    conn: sqlite3.Connection,
    agent_id: str,
    model: str,
    messages: list,
    system: str = "",
    **kwargs,
) -> anthropic.types.Message:
    
    start = time.monotonic()
    response = client.messages.create(
        model=model,
        messages=messages,
        system=system,
        max_tokens=kwargs.get("max_tokens", 1024),
        **{k: v for k, v in kwargs.items() if k != "max_tokens"},
    )
    latency_ms = int((time.monotonic() - start) * 1000)
    
    record = calculate_cost(
        model,
        response.usage.input_tokens,
        response.usage.output_tokens,
    )
    
    log_request(conn, agent_id, model, record, 
                endpoint="messages", latency_ms=latency_ms)
    
    return response  # pass through unchanged โ€” zero impact on calling code

Drop this into your existing codebase and every call is tracked. The latency overhead is a single SQLite write โ€” negligible.

Dashboard Queries: Where the Value Is

Spend by Agent (Last 30 Days)

def spend_by_agent(conn: sqlite3.Connection, days: int = 30) -> list[dict]:
    rows = conn.execute("""
        SELECT
            agent_id,
            COUNT(*)               AS requests,
            SUM(input_tok)         AS total_input_tok,
            SUM(output_tok)        AS total_output_tok,
            SUM(total_cost)        AS total_usd,
            AVG(total_cost)        AS avg_cost_per_req,
            AVG(latency_ms)        AS avg_latency_ms
        FROM llm_requests
        WHERE ts >= datetime('now', ?)
          AND success = 1
        GROUP BY agent_id
        ORDER BY total_usd DESC
    """, (f"-{days} days",)).fetchall()
    
    return [dict(zip([d[0] for d in conn.execute(
        "SELECT * FROM llm_requests LIMIT 0").description], 
        # simpler column extraction:
        row)) for row in rows]

# Cleaner version using row_factory
conn.row_factory = sqlite3.Row

def spend_by_agent_clean(conn, days=30):
    return conn.execute("""
        SELECT agent_id,
               COUNT(*)        requests,
               SUM(total_cost) total_usd,
               AVG(total_cost) avg_per_req
        FROM llm_requests
        WHERE ts >= datetime('now', ?) AND success = 1
        GROUP BY agent_id ORDER BY total_usd DESC
    """, (f"-{days} days",)).fetchall()

Monthly Spend Forecast

from datetime import date

def forecast_monthly_spend(conn: sqlite3.Connection) -> dict:
    today = date.today()
    days_elapsed = today.day  # day of month (1-31)
    days_in_month = 30  # close enough for forecasting
    
    result = conn.execute("""
        SELECT SUM(total_cost) as mtd_spend
        FROM llm_requests
        WHERE ts >= date('now', 'start of month')
          AND success = 1
    """).fetchone()
    
    mtd = result["mtd_spend"] or 0.0
    daily_rate = mtd / days_elapsed if days_elapsed > 0 else 0
    forecast = daily_rate * days_in_month
    
    return {
        "month_to_date_usd": round(mtd, 4),
        "daily_rate_usd":    round(daily_rate, 6),
        "forecast_month_usd": round(forecast, 4),
        "days_elapsed":      days_elapsed,
    }

This gives you numbers like: “You’ve spent $47.20 in 12 days, running at $3.93/day, forecasted monthly total: $117.90.” That’s actionable. A raw billing page doesn’t tell you that.

Cost Optimization Signals You Can Extract

Once you have per-request data, some useful patterns emerge quickly:

  • Output-heavy agents: If output_tok / input_tok > 3, your agent may be over-generating. Cap max_tokens aggressively or switch to a structured output schema.
  • High-cost, low-complexity requests: If an agent consistently uses Claude Sonnet for tasks where Haiku’s outputs are identical, you’re paying 5x too much. Route by task complexity.
  • Retry waste: Track success = 0 separately. If 5% of your requests are retries, fix the upstream issue โ€” you’re paying for failed completions in some billing models.
  • Latency outliers aren’t always expensive: A slow request with low token count is a network issue, not a cost issue. The two metrics together tell a different story than either alone.

Model Routing Based on Cost Data

Once you’ve run the tracker for a week, you’ll have real data to inform routing decisions. A simple threshold approach: if average response quality is acceptable on Haiku for a given agent (judge this manually or with an eval), the cost difference is often 5-10x. For an agent running 50,000 requests/month, that’s the difference between $200 and $20.

Adding a Lightweight Web Dashboard

If you want something visual without spinning up a full observability stack, a 30-line Flask endpoint + a simple HTML table is faster than setting up Grafana for a solo project:

from flask import Flask, jsonify
app = Flask(__name__)

@app.route("/costs/summary")
def cost_summary():
    conn = sqlite3.connect("llm_costs.db")
    conn.row_factory = sqlite3.Row
    return jsonify({
        "by_agent":  [dict(r) for r in spend_by_agent_clean(conn)],
        "forecast":  forecast_monthly_spend(conn),
    })

# Hit /costs/summary from a simple HTML page or pipe into 
# a Retool/Notion dashboard via their API integrations

For teams, I’d push this data into a Postgres instance and use Metabase โ€” it’s free, connects directly to SQL, and you get charts without writing frontend code. For solo founders, a weekly cron that emails you the forecast is honestly enough.

What This Doesn’t Solve (Honest Assessment)

This tracker handles synchronous API calls cleanly. It gets messier with:

  • Streaming responses: Token counts are only available in the final chunk. You’ll need to buffer and log at stream end, not in real-time.
  • Batch API: Anthropic and OpenAI batch endpoints report usage differently โ€” check their docs and adjust the extraction logic.
  • Cached tokens: Claude’s prompt caching returns separate cache_read_input_tokens counts at a lower price. If you’re using caching (you should be for long system prompts), your cost calculation needs to handle that separately or you’ll overestimate.
  • Embeddings and image tokens: Different pricing model entirely โ€” add separate entries to MODEL_PRICING and separate endpoint logic.

When to Use This vs. a Third-Party Tool

Build this yourself if: you’re running a multi-agent system, need agent-level attribution that third-party dashboards don’t support, or want this data feeding into your own product metrics.

Use Helicone, LangSmith, or Portkey if: you want tracing and observability beyond cost (prompt versions, evals, session replay). They all offer cost tracking, but the agent-level tagging is less flexible than owning the schema yourself.

For most technical founders building production agents, the LLM cost tracking calculator approach described here takes an afternoon to wire in and pays for itself the first time it catches an agent loop burning $50 in an hour. The SQL queries are simple enough to extend as your needs grow, and you’re not locked into a vendor’s data model when you need to build a custom cost allocation report for a client.

Solo founders: start with the SQLite version and the email cron. Teams of three or more: push to Postgres on day one, you’ll thank yourself when two people are querying it simultaneously and you’re not debugging SQLite locking issues.

Editorial note: API pricing, model capabilities, and tool features change frequently โ€” always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links โ€” we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply