Rate Limiting Strategies for LLM APIs: Handling Quota Costs and Throttling

Most runaway LLM API bills aren’t caused by one catastrophic request — they’re caused by a loop that runs 500 times instead of 5, a batch job that didn’t respect token limits, or a user who refreshed the page 40 times while your agent re-generated a 2,000-token response on each hit. Rate limiting LLM API costs is the unglamorous, unsexy discipline that separates “we hit $800 this month” from “we hit $8,000 and had a very uncomfortable Monday.” This article covers the practical mechanics: token budget enforcement, request throttling at the application layer, cost-aware queuing, and the fallback patterns that keep your service alive when you’re close to hitting quota.

The documentation will tell you about the provider’s rate limits. This article is about the limits you set on yourself — which matter more.

Why Provider-Side Rate Limits Aren’t Enough

Anthropic, OpenAI, and Google all impose rate limits measured in requests per minute (RPM) and tokens per minute (TPM). Hit those limits and you get a 429. Most developers implement a basic retry-with-backoff and call it done. That’s necessary but not sufficient.

Provider limits are a ceiling on throughput, not a ceiling on spend. If your TPM limit is 100,000 and your average request uses 800 tokens, you can theoretically fire ~125 requests per minute — at Claude Sonnet 3.5 pricing (~$3/M input, ~$15/M output), a sustained 10-minute burst at that rate costs roughly $30. In a poorly designed agent loop, that’s not hypothetical.

The three failure modes I see most often in production:

Unbounded agent loops — no max iteration count, so a confused agent keeps calling the LLM until it exhausts your daily budget.
Missing user-level quotas — one power user hammers your product and consumes tokens that should be distributed across all users.
No cost attribution — you can see total spend but have no idea which workflow, user, or feature is responsible.

Application-layer rate limiting solves all three. Provider-side limits do none of them.

Token Budget Enforcement: The Right Unit of Measurement

Request count is a weak proxy for cost. A 10-token classification request and a 4,000-token document summary are not the same thing, but naive rate limiters treat them identically. Budget by tokens, not requests.

Estimating tokens before you send

Anthropic’s tokenizer is close to the GPT-4 tokenizer. For English prose, 1 token ≈ 0.75 words. For code, expect slightly more tokens per word. You can get an exact count before sending using tiktoken for OpenAI or Anthropic’s own counting endpoint:

import anthropic

client = anthropic.Anthropic()

def count_tokens(messages: list[dict], model: str = "claude-3-5-sonnet-20241022") -> int:
    """Count tokens before sending — use this to enforce budgets pre-flight."""
    response = client.messages.count_tokens(
        model=model,
        messages=messages,
    )
    return response.input_tokens

# Example: enforce a per-request ceiling
MAX_INPUT_TOKENS = 4000

messages = [{"role": "user", "content": document_text}]
token_count = count_tokens(messages)

if token_count > MAX_INPUT_TOKENS:
    raise ValueError(f"Request exceeds token budget: {token_count} > {MAX_INPUT_TOKENS}")

The count_tokens endpoint itself costs nothing and returns in ~50ms. Worth calling on any request where input size is user-controlled or variable.

Token bucket implementation for sustained throughput

For controlling spend across concurrent requests, a token bucket algorithm works well. It allows bursts but enforces a long-term rate:

import time
import threading

class TokenBudgetLimiter:
    """
    Token bucket rate limiter measured in LLM tokens, not requests.
    tokens_per_minute: how many LLM tokens you allow across all requests per minute.
    """
    def __init__(self, tokens_per_minute: int):
        self.capacity = tokens_per_minute
        self.tokens = tokens_per_minute
        self.refill_rate = tokens_per_minute / 60  # tokens per second
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

    def acquire(self, token_count: int, timeout: float = 30.0) -> bool:
        """Block until token_count tokens are available, or timeout."""
        deadline = time.monotonic() + timeout
        while True:
            with self.lock:
                self._refill()
                if self.tokens >= token_count:
                    self.tokens -= token_count
                    return True
            wait_time = (token_count - self.tokens) / self.refill_rate
            if time.monotonic() + wait_time > deadline:
                return False  # would exceed timeout — caller decides what to do
            time.sleep(min(wait_time, 1.0))

# Usage: ~40K tokens/min cap, roughly $0.12/min ceiling on Sonnet input side
limiter = TokenBudgetLimiter(tokens_per_minute=40_000)

def call_llm_with_budget(messages: list[dict]) -> str:
    estimated_tokens = count_tokens(messages)
    if not limiter.acquire(estimated_tokens, timeout=10.0):
        raise RuntimeError("Rate limit budget exhausted — try again shortly")
    # proceed with API call
    response = client.messages.create(model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=messages)
    return response.content[0].text

At 40,000 input tokens/minute on Claude Sonnet 3.5, your worst-case input cost is capped at ~$0.12/min or ~$7.20/hr. That’s a ceiling you can reason about and alert on.

Per-User and Per-Feature Cost Attribution

Aggregate spend limits protect your total bill. User-level limits protect you from individual abuse and let you build fair usage tiers. You need both.

The pattern I use in production: Redis-backed counters with TTL, keyed by user ID and feature name. This gives you:

Per-user daily token budgets
Per-feature caps (e.g., “document summarization” gets 50K tokens/day, “chat” gets 100K)
Real-time attribution you can surface in dashboards

import redis
import time

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def check_and_increment_user_budget(
    user_id: str,
    feature: str,
    token_count: int,
    daily_limit: int = 50_000
) -> bool:
    """Returns True if the request is within budget, False if it would exceed it."""
    key = f"token_budget:{user_id}:{feature}:{time.strftime('%Y-%m-%d')}"
    
    pipe = r.pipeline()
    pipe.get(key)
    pipe.incrby(key, token_count)
    pipe.expire(key, 86400)  # 24-hour TTL
    results = pipe.execute()
    
    previous_usage = int(results[0] or 0)
    new_usage = results[1]
    
    if new_usage > daily_limit:
        # Roll back the increment — over budget
        r.decrby(key, token_count)
        return False
    return True

This isn’t perfect — there’s a small race condition between check and increment that you’d fix with a Lua script in high-concurrency scenarios — but it’s correct for the vast majority of production use cases.

For observability, this token-level attribution pairs well with LLM monitoring platforms. If you’re not already tracking per-request cost and latency in production, take a look at the comparison of Helicone, LangSmith, and Langfuse — all three support cost tracking at this granularity.

Intelligent Fallback When Budgets Run Out

A rate limit hit shouldn’t mean a 500 error. It should trigger a degradation strategy. The right strategy depends on the request type:

Tiered model fallback

If a user is over their Sonnet budget, fall back to Haiku. Same API, 20x cheaper ($0.25/M input vs $3/M input), meaningfully lower quality for complex tasks but fine for simple ones. Build this into your LLM call wrapper, not scattered across your codebase.

MODEL_TIERS = [
    "claude-3-5-sonnet-20241022",  # Primary: best quality
    "claude-3-5-haiku-20241022",   # Fallback: ~12x cheaper on input
]

def call_with_fallback(messages: list[dict], user_id: str) -> tuple[str, str]:
    """Returns (response_text, model_used)"""
    for model in MODEL_TIERS:
        token_count = count_tokens(messages, model=model)
        
        if check_and_increment_user_budget(user_id, "chat", token_count):
            response = client.messages.create(
                model=model,
                max_tokens=1024,
                messages=messages
            )
            return response.content[0].text, model
    
    # All tiers exhausted — return graceful degradation message
    return "You've reached your daily usage limit. Resets at midnight UTC.", "none"

For more complex degradation patterns — including retry logic with exponential backoff when you hit provider 429s — the LLM fallback and retry logic guide covers the full pattern including circuit breakers.

Request queuing for batch workloads

For non-interactive workloads — document processing, async analysis, bulk operations — don’t throttle at the HTTP layer, queue at the application layer. A simple queue with a worker that respects your token budget lets you process 10,000 documents without a single 429:

import queue
import threading

work_queue = queue.Queue()
limiter = TokenBudgetLimiter(tokens_per_minute=40_000)

def worker():
    while True:
        item = work_queue.get()
        if item is None:
            break
        
        messages, callback = item
        token_count = count_tokens(messages)
        limiter.acquire(token_count)  # blocks until budget available
        
        try:
            response = client.messages.create(
                model="claude-3-5-haiku-20241022",  # use cheaper model for batch
                max_tokens=1024,
                messages=messages
            )
            callback(response.content[0].text)
        except Exception as e:
            callback(None, error=e)
        finally:
            work_queue.task_done()

# Start worker threads
for _ in range(3):  # 3 concurrent workers, all sharing the same limiter
    t = threading.Thread(target=worker, daemon=True)
    t.start()

If you’re processing documents at scale, Anthropic’s Batch API cuts costs by 50% — check out the batch processing workflow guide for the full implementation with real throughput numbers.

Common Misconceptions Worth Addressing

Misconception 1: “max_tokens controls my cost”

max_tokens caps your output, not your input. If you send a 10,000-token prompt and set max_tokens=100, you’re still billed for 10,000 input tokens. Input cost is determined by what you send, not what you receive. This is where most people’s cost models are wrong — they optimize output length but let input tokens run unchecked.

Misconception 2: “Exponential backoff handles rate limiting”

Backoff handles provider 429s — it doesn’t prevent them, and it definitely doesn’t control your spend. If you have 50 concurrent requests all hitting backoff simultaneously, you’re still burning through your token budget while they wait to retry. Application-layer throttling before the request is the only way to prevent overconsumption, not to recover from it.

Misconception 3: “Caching is a rate limiting strategy”

Caching is a cost reduction strategy that pairs well with rate limiting, but it doesn’t substitute for it. Cache hit rates in production are rarely above 20-30% for anything with user-supplied input. Don’t build your cost model around cache hits being the primary mechanism — use them as a bonus on top of proper budgeting.

That said, prompt caching (Anthropic’s feature for caching the system prompt portion) is worth implementing for any workflow with a large, stable system prompt. If you’re running a consistent agent architecture, that’s free throughput on your most expensive tokens. See the self-hosting vs Claude API cost analysis for a breakdown of where caching moves the needle most.

Putting It Together: A Production-Ready Cost Control Stack

Here’s what a complete stack looks like in practice. You don’t need all of these on day one, but you should have a roadmap to get here:

Pre-flight token counting — estimate tokens before sending, reject requests that exceed per-request limits
Application-layer token bucket — shared limiter across all workers, measured in tokens not requests
Per-user Redis counters — daily budget tracking with TTL, per-feature attribution
Tiered model fallback — Sonnet → Haiku when user hits budget ceiling
Batch queue for async work — non-interactive jobs go through a rate-aware worker queue
Spend alerting — webhook or PagerDuty alert when daily spend hits 50% and 80% of budget
Observability — per-request cost logged with user ID, feature, model, and token count

A solo founder can implement steps 1-3 in an afternoon. Steps 4-7 come as you scale. The critical thing is that you instrument cost attribution from day one — retrofitting it later is painful.

Frequently Asked Questions

How do I set a hard daily spend limit on the Claude or OpenAI API?

Both Anthropic and OpenAI offer spend limits in their billing dashboards — set a hard cap there as your absolute last line of defense. But don’t rely on it alone: their limits cut off your entire API key, not individual users. Implement application-layer per-user and per-feature budgets so you can throttle individual users without killing your service for everyone.

What’s the difference between RPM limits and TPM limits for LLM APIs?

RPM (requests per minute) limits the number of API calls regardless of size. TPM (tokens per minute) limits total token throughput. In practice, TPM is the binding constraint for most applications with variable-length inputs. If you’re hitting RPM limits but not TPM limits, your requests are small and you should batch them. If you’re hitting TPM, you need either a higher tier or to throttle your input sizes.

Can I use a token bucket algorithm for LLM rate limiting, or do I need something more sophisticated?

A token bucket is a solid choice for most production applications — it handles burst traffic gracefully while enforcing a sustainable average rate. For multi-tenant SaaS with complex fair-use requirements, you might layer in a leaky bucket or sliding window counter per user on top of the global token bucket. But start simple: a shared token bucket with per-user Redis counters covers 90% of real-world cases.

How do I handle rate limiting in an n8n or Make workflow that calls an LLM?

n8n has a built-in “Wait” node you can use to add delays between LLM calls — set it to enforce a minimum interval based on your TPM budget. For more sophisticated control, route your LLM calls through a thin API middleware layer you control, where you apply token bucket logic before forwarding to Anthropic or OpenAI. This gives you rate limiting, logging, and fallback in one place regardless of which automation platform calls it.

Does prompt caching reduce my effective rate limit usage?

Yes — Anthropic’s prompt caching charges cached tokens at ~10% of the normal input rate, and cached tokens don’t count toward your TPM at the same weight as uncached tokens in practice. For workflows with large, stable system prompts (like a RAG knowledge base preamble), prompt caching can cut your effective token consumption by 60-80% on the expensive portion of each request.

What should I do when a user hits their daily token budget?

Don’t return a raw error — return a user-friendly message with a reset time and, if applicable, an upgrade path. In your backend, log the budget-hit event so you can see which users are consistently hitting limits (potential upsell signal) versus which ones hit it from a bug (potential product issue). Consider falling back to a cheaper model for a degraded-but-functional experience rather than hard-refusing the request.

Bottom Line: Who Needs What Level of Sophistication

Solo founder / early prototype: Start with pre-flight token counting and a simple global token bucket. Add per-user Redis counters before you open to more than ~20 users. Set a hard spend cap in the API dashboard as backup. Total implementation time: half a day.

Small team / growing product: Add tiered model fallback and observability. Route all LLM calls through a single internal service so rate limiting logic lives in one place. Log cost attribution from day one — you’ll thank yourself when a specific feature explodes in usage.

High-volume / enterprise: Implement the full stack above plus spend alerting at 50%/80% of daily budget. Consider dedicated throughput tiers from Anthropic (Provisioned Throughput) if you’re spending >$5K/month — the unit economics change significantly at scale. Controlling rate limiting LLM API costs at scale is ultimately an architectural discipline: every LLM call goes through a gateway that enforces budgets, logs costs, and can degrade gracefully. Build that gateway early.

Put this into practice

Try the Ai Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Rate Limiting Strategies for LLM APIs: Handling Quota Costs and Throttling

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Rate Limiting Strategies for LLM APIs: Handling Quota Costs and Throttling

Why Provider-Side Rate Limits Aren’t Enough

Token Budget Enforcement: The Right Unit of Measurement

Estimating tokens before you send

Token bucket implementation for sustained throughput

Per-User and Per-Feature Cost Attribution

Intelligent Fallback When Budgets Run Out

Tiered model fallback

Request queuing for batch workloads

Common Misconceptions Worth Addressing

Misconception 1: “max_tokens controls my cost”

Misconception 2: “Exponential backoff handles rate limiting”

Misconception 3: “Caching is a rate limiting strategy”

Putting It Together: A Production-Ready Cost Control Stack

Frequently Asked Questions

How do I set a hard daily spend limit on the Claude or OpenAI API?

What’s the difference between RPM limits and TPM limits for LLM APIs?

Can I use a token bucket algorithm for LLM rate limiting, or do I need something more sophisticated?

How do I handle rate limiting in an n8n or Make workflow that calls an LLM?

Does prompt caching reduce my effective rate limit usage?

What should I do when a user hits their daily token budget?

Bottom Line: Who Needs What Level of Sophistication

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation