Batch Processing with LLM APIs: Handling 50,000+ Documents Efficiently and Cheaply

Q: Can I cancel a batch job after submission?

Yes. Call client.messages.batches.cancel(batch_id). Requests that have already been processed will still be available in the results; unprocessed ones are cancelled. You're billed only for requests that completed before cancellation.

By the end of this tutorial, you’ll have a working Python pipeline that submits 50,000+ documents to Claude’s Batch API, polls for completion, and retrieves results — at roughly half the cost of synchronous API calls. If you’re processing large document volumes and paying full price per request, you’re leaving real money on the table.

Batch processing LLM APIs is one of the most underused cost-reduction strategies available right now. Anthropic’s Message Batches API offers a guaranteed 50% discount over standard API pricing in exchange for a relaxed turnaround time (up to 24 hours). For workloads that don’t need real-time responses — content audits, document classification, data extraction, nightly enrichment jobs — this is a straightforward win.

At current Claude Haiku 3.5 pricing (~$0.0008 per 1K input tokens via batch vs ~$0.0016 synchronous), processing 50,000 documents averaging 500 tokens each costs roughly $20 batch vs $40 synchronous. At Sonnet scale, the savings are even more significant. That’s before you factor in that you’re no longer building rate-limit retry logic around a sustained high-throughput stream.

Install dependencies — set up the Anthropic SDK and supporting libraries
Structure your batch requests — format documents into the JSONL request schema
Submit the batch job — send the batch and capture the batch ID
Poll for completion — implement a non-blocking status checker
Retrieve and parse results — stream results and map back to source documents
Handle failures and partial results — deal with individual request failures without losing the whole job

Step 1: Install Dependencies

You need anthropic>=0.30.0 for the Batches API. The HTTPX streaming support for result retrieval requires at least that version — earlier releases don’t have the messages.batches namespace at all.

pip install anthropic>=0.30.0 python-dotenv tqdm

import os
import json
import time
from pathlib import Path
from dotenv import load_dotenv
import anthropic
from tqdm import tqdm

load_dotenv()

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

Step 2: Structure Your Batch Requests

Each request in the batch needs a unique custom_id you control — this is how you map results back to your source documents. The params block is identical to a standard Messages API call. There’s a hard limit of 10,000 requests per batch, so for 50,000 documents you’ll be splitting into five batches.

def build_batch_requests(documents: list[dict], system_prompt: str) -> list[dict]:
    """
    documents: list of {"id": str, "text": str}
    Returns a list of batch request objects (max 10,000 per batch)
    """
    requests = []
    for doc in documents:
        requests.append({
            "custom_id": doc["id"],  # must be unique within the batch
            "params": {
                "model": "claude-haiku-4-5",  # use Haiku for cost; swap for Sonnet if quality matters
                "max_tokens": 512,
                "system": system_prompt,
                "messages": [
                    {"role": "user", "content": doc["text"]}
                ]
            }
        })
    return requests


def chunk_requests(requests: list, chunk_size: int = 10_000) -> list[list]:
    """Split into sub-lists of up to 10,000 — the API hard limit per batch."""
    return [requests[i:i+chunk_size] for i in range(0, len(requests), chunk_size)]

If you’re extracting structured data from these documents, pair this with Claude’s structured extraction patterns — using JSON mode in your system prompt keeps downstream parsing trivial.

Step 3: Submit the Batch Job

Submission is a single API call. It returns immediately with a batch object containing an id you’ll need for polling. Save this ID to disk immediately — if your script dies before completion, you can still retrieve results without re-running the whole job.

def submit_batch(requests: list[dict]) -> str:
    """Submit one batch (up to 10,000 requests). Returns the batch ID."""
    response = client.messages.batches.create(requests=requests)
    batch_id = response.id
    
    # Persist the batch ID so we can recover if the process restarts
    with open("batch_ids.txt", "a") as f:
        f.write(batch_id + "\n")
    
    print(f"Submitted batch: {batch_id} ({len(requests)} requests)")
    return batch_id


def submit_all_batches(documents: list[dict], system_prompt: str) -> list[str]:
    """Handle documents > 10,000 by splitting into multiple batch jobs."""
    all_requests = build_batch_requests(documents, system_prompt)
    chunks = chunk_requests(all_requests)
    
    batch_ids = []
    for i, chunk in enumerate(chunks):
        print(f"Submitting chunk {i+1}/{len(chunks)}...")
        batch_id = submit_batch(chunk)
        batch_ids.append(batch_id)
        time.sleep(1)  # avoid hammering the submission endpoint
    
    return batch_ids

Step 4: Poll for Completion

Batches can take anywhere from 5 minutes to 24 hours. Poll with exponential backoff — don’t hammer the status endpoint every second. The batch status moves through in_progress → ended. The ended state covers both full success and partial completion (some requests failed).

def poll_until_complete(batch_id: str, poll_interval_seconds: int = 60) -> dict:
    """
    Block until the batch completes. Returns the final batch object.
    For production, run this in a background job — not in your web request handler.
    """
    print(f"Polling batch {batch_id}...")
    
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        status = batch.processing_status
        counts = batch.request_counts
        
        print(
            f"  Status: {status} | "
            f"Succeeded: {counts.succeeded} | "
            f"Errored: {counts.errored} | "
            f"Processing: {counts.processing}"
        )
        
        if status == "ended":
            return batch
        
        time.sleep(poll_interval_seconds)

For production jobs, I’d push this polling into a lightweight scheduled task (a cron job or an n8n workflow) rather than a blocking Python loop. That way your main process can exit and restart without losing state. If you’re comparing orchestration options, the n8n vs Make vs Zapier breakdown is worth reading before you commit to a scheduling layer.

Step 5: Retrieve and Parse Results

Results are streamed as JSONL — don’t load the entire response into memory at once if you’re processing tens of thousands of documents. The SDK exposes a streaming iterator. Map each result back to your source document using the custom_id.

def retrieve_results(batch_id: str, output_path: str = "results.jsonl"):
    """
    Stream results and write to a JSONL file.
    Each line: {"id": custom_id, "status": "succeeded"|"errored", "content": str|None, "error": str|None}
    """
    succeeded = 0
    errored = 0
    
    with open(output_path, "w") as out_file:
        # .results() returns a streaming iterator — memory-safe for large batches
        for result in client.messages.batches.results(batch_id):
            custom_id = result.custom_id
            
            if result.result.type == "succeeded":
                content = result.result.message.content[0].text
                out_file.write(json.dumps({
                    "id": custom_id,
                    "status": "succeeded",
                    "content": content,
                    "error": None
                }) + "\n")
                succeeded += 1
            else:
                # errored or expired
                error_msg = str(result.result.error) if hasattr(result.result, "error") else "unknown"
                out_file.write(json.dumps({
                    "id": custom_id,
                    "status": "errored",
                    "content": None,
                    "error": error_msg
                }) + "\n")
                errored += 1
    
    print(f"Done. Succeeded: {succeeded}, Errored: {errored}")
    print(f"Results written to {output_path}")

Step 6: Handle Failures and Partial Results

Individual request failures don’t kill the batch — the job completes and you get a mix of succeeded and errored results. The right pattern is to collect all failed custom_ids and resubmit them as a new batch. Don’t retry inline synchronously — you lose the cost benefit.

def collect_failed_ids(results_path: str) -> list[str]:
    """Read the results JSONL and return IDs that need retrying."""
    failed_ids = []
    with open(results_path) as f:
        for line in f:
            record = json.loads(line)
            if record["status"] == "errored":
                failed_ids.append(record["id"])
    return failed_ids


def retry_failed_documents(failed_ids: list[str], original_documents: list[dict], system_prompt: str):
    """Look up the original document text by ID and resubmit as a new batch."""
    doc_map = {doc["id"]: doc for doc in original_documents}
    failed_docs = [doc_map[fid] for fid in failed_ids if fid in doc_map]
    
    if not failed_docs:
        print("No failed documents to retry.")
        return
    
    print(f"Retrying {len(failed_docs)} failed documents...")
    batch_ids = submit_all_batches(failed_docs, system_prompt)
    return batch_ids

For workloads where individual failures are unacceptable (compliance document processing, for instance), consider a hybrid approach: use batch for the bulk volume and fall back to the synchronous API for any failures. The fallback and retry logic patterns article covers how to structure that gracefully without building a spaghetti retry loop.

Putting It All Together

SYSTEM_PROMPT = """You are a document classifier. Given the document text, respond with a JSON object:
{"category": "...", "sentiment": "positive|neutral|negative", "summary": "one sentence"}
Respond only with the JSON object, no other text."""

# Load your documents — each needs an "id" and "text" field
documents = [
    {"id": f"doc_{i}", "text": f"Sample document content {i}..."}
    for i in range(50_000)
]

# Submit
batch_ids = submit_all_batches(documents, SYSTEM_PROMPT)

# Poll all batches
for batch_id in batch_ids:
    poll_until_complete(batch_id, poll_interval_seconds=120)
    retrieve_results(batch_id, output_path=f"results_{batch_id}.jsonl")

# Retry failures
for batch_id in batch_ids:
    failed_ids = collect_failed_ids(f"results_{batch_id}.jsonl")
    if failed_ids:
        retry_failed_documents(failed_ids, documents, SYSTEM_PROMPT)

Common Errors

Error 1: “request_too_large” on individual items

This happens when a single document exceeds the token limit for the model. The batch will complete but those requests will be marked errored. Fix: pre-filter documents with a token estimator before submission. A rough heuristic is 4 characters per token — anything over 600K characters for a 200K context model should be chunked first. For very large documents, see the RAG pipeline pattern for splitting and summarising before you even hit the batch API.

Error 2: Batch expires before you poll

Batches expire after 24 hours. If your polling script has a bug or crashes, you lose the results. Fix: always persist the batch_id to durable storage (database, S3 object) immediately after submission. The API will let you retrieve results at any point during the 24-hour window — you don’t need to have been polling the whole time.

Error 3: Duplicate custom_id within a batch

The API rejects batches with duplicate custom_id values. This bites you when your document IDs aren’t actually unique (e.g., using database row IDs across different tables without namespacing). Fix: prefix IDs with a job identifier: f"job_{job_id}_{doc_id}". Always deduplicate before submission.

Real Cost Numbers

To make this concrete: processing 50,000 documents at 500 input tokens + 200 output tokens each with Claude Haiku 3.5:

Synchronous API: 25M input tokens × $0.0008/1K + 10M output × $0.004/1K = $20 + $40 = $60
Batch API (50% discount): $30 total
Saving: $30 per 50K documents

At Sonnet 3.5 pricing the saving is $150+ for the same volume. If you’re running these jobs daily, that’s real infrastructure budget recovered. This also pairs well with thinking about model selection for high-volume workloads — sometimes the right answer is mixing models depending on the complexity of each document.

What to Build Next

Add a job management layer — right now the batch IDs are written to a flat text file. The natural extension is a SQLite or Postgres table that tracks: batch_id, submitted_at, status, total_requests, succeeded_count, errored_count, and results_path. Wire a simple polling daemon to that table and you have a lightweight async document processing queue that survives process restarts, gives you job history, and makes it easy to build a status dashboard or Slack notification on completion.

Frequently Asked Questions

How long does Claude’s Batch API actually take to process requests?

Anthropic guarantees results within 24 hours, but in practice smaller batches (under 5,000 requests) often complete in 5–30 minutes during off-peak hours. Larger batches or peak times can take several hours. Don’t build any workflow that assumes a specific turnaround time — design for “check after 1 hour, poll every 30 minutes after that”.

Can I cancel a batch job after submission?

Yes. Call client.messages.batches.cancel(batch_id). Requests that have already been processed will still be available in the results; unprocessed ones are cancelled. You’re billed only for requests that completed before cancellation.

What’s the maximum number of requests per batch and per day?

Each batch is capped at 10,000 requests. There’s also a limit of 100,000 requests across all in-flight batches at once — so you can’t just submit 50 batches simultaneously and walk away. You’ll need to throttle submissions and wait for earlier batches to complete before queuing more. Check your tier limits in the Anthropic console as these change with account tier.

Does the Batch API support all Claude models?

As of writing, the Batch API supports Claude Haiku, Sonnet, and Opus 3/3.5 variants. It does not support Claude 2 models or the vision-only endpoints. Always check the Anthropic documentation before building — supported models expand over time.

How do I handle documents that are too long for the context window?

Pre-chunk them before creating batch requests. A rough heuristic: estimate tokens as len(text) / 4, and if that exceeds 80% of your target model’s context window, split the document into overlapping chunks and submit each chunk as a separate request. Aggregate the results afterward. For large-scale RAG workflows this chunking step is a prerequisite anyway.

Is the Batch API available on all Anthropic account tiers?

The Batch API requires at least a paid account. Free-tier and trial accounts don’t have access. Rate limits and concurrent batch caps also vary by tier — if you’re processing 50K+ documents regularly, you may need to request a limit increase through the Anthropic console.

Put this into practice

Try the Ai Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Batch Processing with LLM APIs: Handling 50,000+ Documents Efficiently and Cheaply

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Batch Processing with LLM APIs: Handling 50,000+ Documents Efficiently and Cheaply

Step 1: Install Dependencies

Step 2: Structure Your Batch Requests

Step 3: Submit the Batch Job

Step 4: Poll for Completion

Step 5: Retrieve and Parse Results

Step 6: Handle Failures and Partial Results

Putting It All Together

Common Errors

Error 1: “request_too_large” on individual items

Error 2: Batch expires before you poll

Error 3: Duplicate custom_id within a batch

Real Cost Numbers

What to Build Next

Frequently Asked Questions

How long does Claude’s Batch API actually take to process requests?

Can I cancel a batch job after submission?

What’s the maximum number of requests per batch and per day?

Does the Batch API support all Claude models?

How do I handle documents that are too long for the context window?

Is the Batch API available on all Anthropic account tiers?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation