If you’re running document analysis, classification, or extraction tasks against Claude one request at a time, you’re paying roughly twice what you need to and your pipeline probably breaks on anything over a few hundred documents. The batch processing API from Anthropic cuts costs by 50% and removes the per-minute rate limit problem entirely — but the documentation glosses over several sharp edges that will bite you in production. This article covers the real implementation: chunking strategies, async polling, error recovery, and what actually happens when a batch partially fails.
Why Synchronous LLM Calls Break at Scale
The naive approach is a loop: iterate over your documents, call the API, write the result. Works fine for 50 documents. At 5,000, you’re fighting three separate problems simultaneously.
- Rate limits: Claude’s API enforces tokens-per-minute limits. At claude-3-haiku-20240307 pricing you’d hit the default tier ceiling somewhere around 500–800 document summaries per minute depending on doc length.
- Cost: Synchronous calls use standard pricing. The Message Batches API costs 50% less — $0.00025 per 1K input tokens on Haiku vs $0.0005 synchronous.
- Failure amplification: A network hiccup at document 3,000 means either you retry everything or you build your own checkpoint system.
The batch API solves the first two directly. The third one you still have to handle yourself, but at least the batch itself is atomic — Anthropic’s infrastructure holds the requests and you poll for results rather than maintaining a persistent connection.
How the Message Batches API Actually Works
You submit a list of requests (up to 10,000 per batch) as a single API call. Each request in the batch gets a custom_id you define. Anthropic processes them asynchronously and you poll a status endpoint until the batch completes. Results are returned as a JSONL file you download.
Current limits worth knowing: 10,000 requests per batch, batches expire after 29 days, and processing typically completes within an hour though Anthropic’s SLA is technically 24 hours. In practice I’ve seen 5,000-item batches finish in 15–20 minutes during off-peak hours.
Submitting a Batch
import anthropic
import json
from pathlib import Path
client = anthropic.Anthropic(api_key="your-api-key")
def build_batch_requests(documents: list[dict]) -> list[dict]:
"""
Convert a list of documents into batch request format.
Each document needs a unique custom_id for tracking results.
"""
requests = []
for doc in documents:
requests.append({
"custom_id": f"doc-{doc['id']}", # must be unique within batch
"params": {
"model": "claude-3-haiku-20240307",
"max_tokens": 512,
"messages": [
{
"role": "user",
"content": f"Classify the following document into one of: [invoice, contract, report, correspondence, other]. Respond with just the category label.\n\n{doc['text'][:4000]}" # truncate to avoid token blowout
}
]
}
})
return requests
# Load your documents however you normally would
documents = [
{"id": "001", "text": "Dear Sir, Please find attached the invoice..."},
{"id": "002", "text": "This agreement is entered into between..."},
# ... up to 10,000
]
requests = build_batch_requests(documents)
# Submit the batch
batch = client.beta.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
# Save the batch ID — you need this to poll and retrieve results
Path("batch_state.json").write_text(json.dumps({
"batch_id": batch.id,
"document_count": len(requests)
}))
Save that batch ID somewhere durable. If your polling script crashes before results download, you can resume from the ID without resubmitting.
Polling for Completion
import time
import anthropic
import json
client = anthropic.Anthropic(api_key="your-api-key")
def poll_batch_until_complete(batch_id: str, poll_interval: int = 60) -> dict:
"""
Poll batch status until processing completes or fails.
Returns the final batch object.
poll_interval: seconds between checks (60s is reasonable — no point hammering)
"""
while True:
batch = client.beta.messages.batches.retrieve(batch_id)
status = batch.processing_status
counts = batch.request_counts
print(f"Status: {status} | Processing: {counts.processing} | "
f"Succeeded: {counts.succeeded} | Errored: {counts.errored}")
if status == "ended":
return batch
if status not in ("in_progress", "canceling"):
raise RuntimeError(f"Unexpected batch status: {status}")
time.sleep(poll_interval)
state = json.loads(open("batch_state.json").read())
final_batch = poll_batch_until_complete(state["batch_id"])
print(f"Complete. Succeeded: {final_batch.request_counts.succeeded}, "
f"Errored: {final_batch.request_counts.errored}")
Downloading and Processing Results
Results come back as JSONL — one result object per line, keyed by your custom_id. The order is not guaranteed to match submission order, which trips people up when they’re trying to join results back to the original dataset.
def download_and_parse_results(batch_id: str) -> dict[str, str]:
"""
Download batch results and return a dict mapping custom_id -> extracted text.
Handles both successful and errored results.
"""
results = {}
errors = {}
# Stream the JSONL result file
for result in client.beta.messages.batches.results(batch_id):
custom_id = result.custom_id
if result.result.type == "succeeded":
# Extract text from the response
message = result.result.message
content_text = message.content[0].text if message.content else ""
results[custom_id] = content_text.strip()
elif result.result.type == "errored":
error = result.result.error
errors[custom_id] = f"{error.type}: {error.message}"
print(f"Error on {custom_id}: {error.type}")
if errors:
print(f"\n{len(errors)} requests failed. Saving error log.")
with open("batch_errors.json", "w") as f:
json.dump(errors, f, indent=2)
return results
results = download_and_parse_results(state["batch_id"])
# Join back to original documents by stripping the "doc-" prefix
for doc_id, classification in results.items():
original_id = doc_id.replace("doc-", "")
print(f"Document {original_id}: {classification}")
Handling Partial Failures and Retries
This is where most batch implementations fall apart. A batch with 10,000 requests will typically have 1–50 errors, usually from documents that hit token limits or contain content that triggers a refusal. You need a retry path that only resubmits the failures.
def retry_failed_requests(error_ids: list[str], original_documents: list[dict]) -> dict:
"""
Resubmit only the documents that failed in the original batch.
Uses a smaller max_tokens budget to avoid repeat failures on edge cases.
"""
# Build lookup from document ID
doc_lookup = {f"doc-{doc['id']}": doc for doc in original_documents}
retry_requests = []
for custom_id in error_ids:
if custom_id not in doc_lookup:
print(f"Warning: can't find original doc for {custom_id}")
continue
doc = doc_lookup[custom_id]
retry_requests.append({
"custom_id": f"retry-{custom_id}", # different prefix to avoid ID collision
"params": {
"model": "claude-3-haiku-20240307",
"max_tokens": 256, # reduced budget for retry
"messages": [{
"role": "user",
"content": f"Classify as invoice/contract/report/correspondence/other:\n\n{doc['text'][:2000]}" # smaller context window
}]
}
})
if not retry_requests:
return {}
print(f"Retrying {len(retry_requests)} failed requests...")
retry_batch = client.beta.messages.batches.create(requests=retry_requests)
final = poll_batch_until_complete(retry_batch.id)
return download_and_parse_results(retry_batch.id)
Chunking Strategies for 10K+ Document Sets
The 10,000-request limit per batch means a 50,000-document job needs 5 batches. Don’t submit them all simultaneously — Anthropic’s systems handle concurrent batches but you’ll get better throughput staggering submissions by 5–10 minutes.
import math
def chunk_and_submit_batches(documents: list[dict], chunk_size: int = 8000) -> list[str]:
"""
Split documents into chunks and submit as separate batches.
Returns list of batch IDs for later polling.
Using 8000 instead of 10000 leaves headroom for retries.
"""
batch_ids = []
n_chunks = math.ceil(len(documents) / chunk_size)
for i in range(n_chunks):
chunk = documents[i * chunk_size:(i + 1) * chunk_size]
requests = build_batch_requests(chunk)
batch = client.beta.messages.batches.create(requests=requests)
batch_ids.append(batch.id)
print(f"Submitted batch {i+1}/{n_chunks}: {batch.id} ({len(chunk)} docs)")
# Stagger submissions to avoid queuing issues
if i < n_chunks - 1:
time.sleep(30)
return batch_ids
Real Cost Numbers for Common Workloads
Let’s be specific. For a document classification task where the average document is 500 words (roughly 650 tokens) and you want a short label output (~10 tokens):
- Input tokens per request: ~700 (system prompt overhead included)
- Output tokens per request: ~15
- Synchronous Haiku cost per 10K docs: ~$3.65 (input) + $0.018 (output) = ~$3.67
- Batch API Haiku cost per 10K docs: ~$1.84 (input) + $0.009 (output) = ~$1.85
That’s a real $1.82 saving per 10K documents. At 100K documents/month, you’re saving ~$180/month just from switching to batch. At document summarization scale (2,000-token input, 300-token output), the savings scale linearly and you’re looking at $15–20 saved per 10K documents.
Haiku is the right model for classification and extraction at this scale. Sonnet costs 5x more and isn’t meaningfully better at structured output tasks with clear classification schemas.
What Breaks in Production (And How to Fix It)
Token limit errors on individual requests
Documents that exceed the model’s context window silently error in the batch result. Always truncate input at a safe limit (4,000 tokens for Haiku with a reasonable system prompt). Build truncation into your build_batch_requests function, not as an afterthought.
The 29-day expiry trap
Batch results expire after 29 days. If you’re running monthly reconciliation jobs and submit on day 1 but don’t download until day 30, you lose the results. Download results within 24 hours of completion and store them in your own storage.
Polling script restarts
Always persist batch IDs to disk immediately after submission. A batch that’s processing doesn’t need resubmission — you just need the ID to resume polling. The example above uses a JSON file; in production, write this to your database.
Batch processing API rate limits on submission
Yes, you can hit rate limits on batch submission itself if you’re submitting too many batches quickly. If you get a 429 on batches.create(), wait 60 seconds and retry with exponential backoff. The staggered submission pattern above mostly avoids this.
When to Use Batch vs Synchronous
Use the batch processing API when: you have 500+ requests to run, latency doesn’t matter (results within hours is fine), and cost is a concern. Document pipelines, nightly data enrichment, bulk classification on newly ingested records — these are all ideal.
Stick with synchronous when: you need results in under 5 seconds, you’re building a user-facing feature, or you have fewer than 100 requests (the operational overhead isn’t worth it below that threshold).
My recommendation by reader type:
- Solo founder processing invoices/contracts: Batch everything. The cost savings alone justify it. Use Haiku, 8K-item chunks, poll every 2 minutes, store results in Postgres.
- Team running a data pipeline product: Build batch submission into your ingestion worker, synchronous for real-time user requests. Keep them as separate code paths — don’t try to abstract them together.
- Enterprise with compliance requirements: Note that batch requests go through the same data handling as synchronous — same retention policies apply. Check your Anthropic data processing agreement before batching sensitive documents.
- n8n/Make automation builders: These platforms don’t natively support the batch API (they’re built around synchronous calls). You’ll need a custom code node or a sidecar service to handle submission and polling. Worth it if you’re processing more than 1,000 documents per run.
The batch processing API is one of the few cases where the cost optimization is genuinely mechanical — no prompt engineering, no model switching, just a different submission pattern for the same results at half the price. Start with the chunking and polling code above, add your retry logic, and you’ll have a production-grade pipeline that handles 10K+ documents without burning your API budget.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

