Rate limiting and backoff strategies for Claude API in production: avoid quota exhaustion

Q: Can I use the Anthropic Python SDK's built-in retry logic instead of implementing my own?

Yes — pass max_retries=5 to the anthropic.Anthropic() constructor and it handles basic retry with backoff automatically. The limitation is that it doesn't integrate with token budgeting or queue concurrency control, so you'll still hit TPM limits even if individual calls eventually succeed. Use the SDK's built-in retry as a safety net, but add the token budget layer on top if you're running concurrent workers or batch jobs.

Q: How do I share rate limit state across multiple servers or worker processes?

The in-process TokenBudgetTracker in this tutorial won't work across processes. For multi-server deployments, store the sliding window in Redis using a sorted set (timestamp as score, token count as value) with ZADD and ZRANGEBYSCORE for atomic reads. Alternatively, put a single rate-limiting proxy service in front of all your Claude API calls — a thin FastAPI or Go service that all workers route through. This is operationally simpler than distributed state and easier to monitor.

By the end of this tutorial, you’ll have a production-ready rate limiting layer for the Claude API: one that handles exponential backoff, tracks token budgets across concurrent workers, and queues requests intelligently instead of dropping them when you hit quota limits. These are the exact Claude API rate limiting strategies I’ve used to keep agent workloads stable under load — not the naive retry loop that every beginner ships and then regrets at 3am.

Anthropic’s rate limits come in three flavors that interact with each other in non-obvious ways: requests per minute (RPM), tokens per minute (TPM), and tokens per day (TPD). Hit any one of them and you get a 429. The documentation describes them clearly enough, but it doesn’t tell you that a burst of 10 concurrent Claude Sonnet 3.5 requests at ~4,000 tokens each will exhaust a Tier 1 TPM limit in a single second, leaving your other workers stalled for the rest of that minute. Understanding this interaction is the foundation of everything below.

Install dependencies — Set up the required Python packages for async API calls and rate tracking.
Build an exponential backoff wrapper — Catch 429s and retry with jitter instead of hammering the API.
Implement a token budget tracker — Count tokens before sending requests to avoid mid-burst quota exhaustion.
Add a queue-based concurrency controller — Serialize or throttle concurrent workers against a shared rate limit.
Wire it all together with a production client — Combine all three layers into one reusable class.

Step 1: Install Dependencies

You need anthropic, tenacity for retry logic, and tiktoken as a fast token counter (Claude uses a similar BPE tokenizer to GPT — not identical, but close enough for budget tracking and the error is always in the safe direction).

pip install anthropic tenacity tiktoken asyncio-throttle

Pin your versions. The anthropic SDK changes its error hierarchy more than you’d expect.

pip install anthropic==0.29.0 tenacity==8.3.0 tiktoken==0.7.0 asyncio-throttle==1.0.2

Step 2: Build an Exponential Backoff Wrapper

The naive approach is time.sleep(1) on 429. The production approach uses exponential backoff with full jitter — randomizing the wait prevents thundering herd when multiple workers hit the limit simultaneously. This is the same pattern discussed in our guide on building LLM fallback and retry logic for production, but adapted specifically for Anthropic’s error types.

import anthropic
import logging
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception_type,
    before_sleep_log,
)

logger = logging.getLogger(__name__)

# Anthropic raises RateLimitError for 429s and APIStatusError for 529 (overloaded)
RETRYABLE_EXCEPTIONS = (
    anthropic.RateLimitError,
    anthropic.APIStatusError,  # includes 529 overloaded
    anthropic.APIConnectionError,
)

@retry(
    retry=retry_if_exception_type(RETRYABLE_EXCEPTIONS),
    wait=wait_exponential_jitter(initial=1, max=60, jitter=8),
    stop=stop_after_attempt(6),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    reraise=True,
)
def call_with_backoff(client: anthropic.Anthropic, **kwargs):
    """
    Wraps client.messages.create with retry logic.
    Pass all standard create() kwargs directly.
    """
    return client.messages.create(**kwargs)

The wait_exponential_jitter config above starts at 1 second, caps at 60, and adds up to 8 seconds of random jitter. With 6 attempts, your worst-case wait before giving up is roughly 4 minutes — acceptable for background jobs, too slow for user-facing requests. For synchronous user flows, drop stop_after_attempt to 3 and set max=10.

One thing the docs don’t warn you about: APIStatusError covers all non-connection HTTP errors, including 400 bad request. You need to check the status code before retrying, or you’ll hammer the API with malformed requests for six attempts.

def is_retryable(exc: Exception) -> bool:
    if isinstance(exc, anthropic.RateLimitError):
        return True
    if isinstance(exc, anthropic.APIStatusError):
        # Only retry 429 and 529, not 400/401/403
        return exc.status_code in (429, 529)
    if isinstance(exc, anthropic.APIConnectionError):
        return True
    return False

Step 3: Implement a Token Budget Tracker

Backoff handles the recovery. Token budgeting prevents the collision in the first place. The idea: count estimated tokens before each request and block if you’re within a threshold of your TPM limit. This is especially important when running batch document processing — if you’re processing thousands of documents with the Claude API, you’ll burn through token quotas faster than your RPM limit would suggest.

import time
import threading
import tiktoken

class TokenBudgetTracker:
    """
    Sliding window token budget tracker.
    Tracks tokens used in the last 60 seconds and blocks
    if adding a new request would exceed the TPM limit.
    """

    def __init__(self, tpm_limit: int, safety_margin: float = 0.9):
        # safety_margin=0.9 means we stop at 90% of limit
        self.tpm_limit = int(tpm_limit * safety_margin)
        self.window = []  # list of (timestamp, token_count) tuples
        self.lock = threading.Lock()
        self._encoder = tiktoken.get_encoding("cl100k_base")  # closest to Claude

    def estimate_tokens(self, text: str) -> int:
        return len(self._encoder.encode(text))

    def _prune_window(self):
        """Remove entries older than 60 seconds."""
        cutoff = time.monotonic() - 60
        self.window = [(ts, count) for ts, count in self.window if ts > cutoff]

    def current_usage(self) -> int:
        with self.lock:
            self._prune_window()
            return sum(count for _, count in self.window)

    def wait_for_budget(self, token_count: int, poll_interval: float = 0.5):
        """Block until there's budget for token_count tokens."""
        while True:
            with self.lock:
                self._prune_window()
                used = sum(count for _, count in self.window)
                if used + token_count <= self.tpm_limit:
                    self.window.append((time.monotonic(), token_count))
                    return
            # Budget exhausted — wait and try again
            time.sleep(poll_interval)

    def record_actual_usage(self, input_tokens: int, output_tokens: int):
        """
        Call this after a response to correct the estimate.
        Output tokens are often larger than you'd guess from the prompt alone.
        """
        with self.lock:
            actual = input_tokens + output_tokens
            # Adjust the last entry if we recorded an estimate
            if self.window:
                ts, _ = self.window[-1]
                self.window[-1] = (ts, actual)

The 90% safety margin matters. tiktoken underestimates Claude’s actual token count by ~5-15% depending on the content (code, JSON, and non-English text are the worst offenders). Running at 90% of your limit gives you a buffer without wasting significant quota headroom.

Step 4: Add a Queue-Based Concurrency Controller

For async workloads with multiple concurrent workers, you need more than a token tracker — you need controlled concurrency. A semaphore limits simultaneous in-flight requests; combining it with the token tracker gives you both RPM and TPM protection.

import asyncio

class AsyncRateLimitedQueue:
    """
    Async queue that enforces max concurrency and rate limits.
    Use this when you have many tasks to process (batch jobs, pipelines).
    """

    def __init__(
        self,
        max_concurrent: int = 5,
        rpm_limit: int = 50,
    ):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        # Minimum gap between requests based on RPM
        self.min_interval = 60.0 / rpm_limit
        self._last_request_time = 0.0
        self._lock = asyncio.Lock()

    async def _wait_for_rpm_slot(self):
        """Enforce minimum interval between requests."""
        async with self._lock:
            now = asyncio.get_event_loop().time()
            elapsed = now - self._last_request_time
            if elapsed < self.min_interval:
                await asyncio.sleep(self.min_interval - elapsed)
            self._last_request_time = asyncio.get_event_loop().time()

    async def run(self, coro):
        """Execute a coroutine within rate-limited concurrency bounds."""
        async with self.semaphore:
            await self._wait_for_rpm_slot()
            return await coro

Set max_concurrent conservatively. Five concurrent requests at 2,000 tokens each = 10,000 tokens/burst. If your Tier 1 Sonnet TPM is 40,000, that’s fine. If you’re on Haiku at 50,000 TPM with 500-token requests, you can push max_concurrent to 10-15 safely. Do the math for your actual workload before tuning this.

Step 5: Wire It All Together with a Production Client

Here’s the complete client that combines all three layers. This is what actually runs in production — not three separate things you have to remember to call in the right order.

import anthropic
import asyncio
from dataclasses import dataclass

@dataclass
class RateLimitConfig:
    rpm_limit: int = 50          # requests per minute
    tpm_limit: int = 40_000      # tokens per minute
    max_concurrent: int = 5      # max parallel requests
    safety_margin: float = 0.90  # % of quota to actually use
    max_retries: int = 6

class ProductionClaudeClient:
    def __init__(self, api_key: str, config: RateLimitConfig = None):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.config = config or RateLimitConfig()
        self.budget = TokenBudgetTracker(
            tpm_limit=self.config.tpm_limit,
            safety_margin=self.config.safety_margin,
        )
        self.queue = AsyncRateLimitedQueue(
            max_concurrent=self.config.max_concurrent,
            rpm_limit=self.config.rpm_limit,
        )

    def _estimate_request_tokens(self, messages: list, system: str = "") -> int:
        combined = system + " ".join(
            m["content"] if isinstance(m["content"], str) else str(m["content"])
            for m in messages
        )
        return self.budget.estimate_tokens(combined)

    async def create_message(
        self,
        messages: list,
        model: str = "claude-3-5-sonnet-20241022",
        max_tokens: int = 1024,
        system: str = "",
        **kwargs,
    ):
        # Estimate tokens and wait for budget
        estimated_tokens = self._estimate_request_tokens(messages, system) + max_tokens
        self.budget.wait_for_budget(estimated_tokens)

        async def _call():
            loop = asyncio.get_event_loop()
            # Run sync SDK call in thread pool to avoid blocking event loop
            response = await loop.run_in_executor(
                None,
                lambda: call_with_backoff(
                    self.client,
                    model=model,
                    max_tokens=max_tokens,
                    messages=messages,
                    system=system,
                    **kwargs,
                ),
            )
            # Correct the token estimate with actuals from response
            self.budget.record_actual_usage(
                response.usage.input_tokens,
                response.usage.output_tokens,
            )
            return response

        return await self.queue.run(_call())

Usage is clean:

import asyncio

async def main():
    client = ProductionClaudeClient(
        api_key="sk-ant-...",
        config=RateLimitConfig(rpm_limit=50, tpm_limit=40_000, max_concurrent=5),
    )

    tasks = [
        client.create_message(
            messages=[{"role": "user", "content": f"Summarize document {i}"}],
            system="You are a concise summarizer.",
            max_tokens=512,
        )
        for i in range(20)
    ]

    results = await asyncio.gather(*tasks)
    for r in results:
        print(r.content[0].text[:100])

asyncio.run(main())

Running 20 requests through this at Tier 1 Sonnet limits costs roughly $0.006 per run at current pricing ($3/$15 per million input/output tokens) — give or take depending on actual response length.

Common Errors and How to Fix Them

Error 1: 429s still occurring after adding backoff

The backoff is working but you’re still getting 429s because multiple processes share the same API key without a shared token budget. The TokenBudgetTracker above is in-process only. If you run the same code on three servers, each thinks it has full quota. Fix: move the budget tracker to Redis with an atomic INCRBY + TTL pattern, or use Anthropic’s Workspaces feature to partition quota between services at the account level.

Error 2: `APIStatusError` on 400 getting retried indefinitely

You forgot to filter on status code in is_retryable(). Replace the retry_if_exception_type(RETRYABLE_EXCEPTIONS) predicate with retry_if_exception(is_retryable) using the function from Step 2. This is one of those things that burns you in staging when requests look fine but a schema change starts sending malformed payloads.

Error 3: Event loop blocking under high concurrency

TokenBudgetTracker.wait_for_budget() uses time.sleep() — that blocks the thread. In pure async code, call it from a thread executor or rewrite it using asyncio.sleep(). The AsyncRateLimitedQueue already uses asyncio.sleep, but the budget tracker is intentionally sync-friendly for teams mixing sync and async code. Pick one model and stick to it; mixing them is how you get subtle deadlocks that only appear under load.

If you’re also dealing with non-rate-limit failures — context length errors, tool call parsing failures, model refusals — the patterns in our error handling and fallback logic guide for production Claude agents pair well with what you’ve built here.

What to Build Next

Add observability. Right now you’re rate limiting correctly but flying blind on how often you’re hitting the budget ceiling, how long workers wait in the queue, and what your actual vs. estimated token counts look like over time. Instrument ProductionClaudeClient with Prometheus counters or push metrics to a service like Langfuse — track tokens_estimated, tokens_actual, budget_wait_seconds, and retry_count_per_request. Once you have those metrics, you can tune max_concurrent and safety_margin with actual data instead of guessing. That observability layer is also where you’d surface the kind of quota pressure signals that let you dynamically route cheaper requests to Claude Haiku when Sonnet quota is tight — a pattern worth building once your workload grows past a few hundred requests per day.

Frequently Asked Questions

What are Claude API rate limits by tier?

Anthropic uses a tiered system where Tier 1 (new accounts) gives you roughly 50 RPM and 40,000 TPM for Claude 3.5 Sonnet, scaling up through Tier 4 (enterprise) at 4,000 RPM and 400,000 TPM. Exact numbers vary by model — Haiku has higher TPM limits than Sonnet at equivalent tiers. Always check the official rate limits page since these change as Anthropic scales capacity. You get bumped to higher tiers by spending more per calendar month.

How do I handle 529 overloaded errors differently from 429 rate limit errors?

Both warrant retry with backoff, but 529 means Anthropic’s infrastructure is under load — not that you’ve exceeded your quota. For 529s, use a longer initial backoff (5-10 seconds minimum) since hammering a temporarily overloaded endpoint wastes both retry attempts and RPM quota. For 429s, check the retry-after header if present; otherwise use exponential backoff. In code, both come through as APIStatusError — filter on exc.status_code to differentiate and tune the backoff parameters separately.

Can I use the Anthropic Python SDK’s built-in retry logic instead of implementing my own?

Yes — pass max_retries=5 to the anthropic.Anthropic() constructor and it handles basic retry with backoff automatically. The limitation is that it doesn’t integrate with token budgeting or queue concurrency control, so you’ll still hit TPM limits even if individual calls eventually succeed. Use the SDK’s built-in retry as a safety net, but add the token budget layer on top if you’re running concurrent workers or batch jobs.

How do I share rate limit state across multiple servers or worker processes?

The in-process TokenBudgetTracker in this tutorial won’t work across processes. For multi-server deployments, store the sliding window in Redis using a sorted set (timestamp as score, token count as value) with ZADD and ZRANGEBYSCORE for atomic reads. Alternatively, put a single rate-limiting proxy service in front of all your Claude API calls — a thin FastAPI or Go service that all workers route through. This is operationally simpler than distributed state and easier to monitor.

Is tiktoken accurate enough for estimating Claude’s token counts?

Close enough for budgeting purposes. Claude uses a custom tokenizer, but tiktoken’s cl100k_base encoding typically underestimates by 5-15% on mixed content. That’s why the TokenBudgetTracker uses a 90% safety margin — it compensates for the underestimate. If you’re processing content with lots of code, JSON, or non-Latin scripts, consider dropping the safety margin to 80% or using Anthropic’s token counting endpoint (client.beta.messages.count_tokens()) for high-value requests where accuracy matters more than speed.

Put this into practice

Try the Api Security Audit agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Rate limiting and backoff strategies for Claude API in production: avoid quota exhaustion

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Rate limiting and backoff strategies for Claude API in production: avoid quota exhaustion

Step 1: Install Dependencies

Step 2: Build an Exponential Backoff Wrapper

Step 3: Implement a Token Budget Tracker

Step 4: Add a Queue-Based Concurrency Controller

Step 5: Wire It All Together with a Production Client

Common Errors and How to Fix Them

Error 1: 429s still occurring after adding backoff

Error 2: APIStatusError on 400 getting retried indefinitely

Error 3: Event loop blocking under high concurrency

What to Build Next

Frequently Asked Questions

What are Claude API rate limits by tier?

How do I handle 529 overloaded errors differently from 429 rate limit errors?

Can I use the Anthropic Python SDK’s built-in retry logic instead of implementing my own?

How do I share rate limit state across multiple servers or worker processes?

Is tiktoken accurate enough for estimating Claude’s token counts?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Error 2: `APIStatusError` on 400 getting retried indefinitely