Sunday, April 5

If you’re deciding which serverless platform for Claude agents to bet your infrastructure on, you’ve probably already hit the same wall most people do: the documentation looks similar, the pricing pages are confusing, and none of the “getting started” guides cover what actually matters at scale — cold start behavior under load, how they handle long-running inference loops, and what breaks when you hit concurrency limits at 2am.

I’ve deployed Claude-based agent workflows on all three — Modal, Replicate, and Beam — and the differences matter more than the marketing suggests. This comparison focuses specifically on the agent use case: stateful-ish Python functions, variable execution times, calls to the Anthropic API, and workloads that mix compute-heavy preprocessing with network-bound LLM calls. If you’re running simple one-shot completions behind a REST endpoint, any of these will work fine. If you’re building agents with tool use, multi-step reasoning chains, or document processing pipelines, read carefully.

What You’re Actually Choosing Between

The three platforms occupy slightly different points in the serverless compute space:

  • Modal — Python-native serverless compute with excellent developer experience, fine-grained container control, and first-class GPU support. Built by engineers who clearly use it themselves.
  • Replicate — Originally built for ML model hosting (Stable Diffusion, Llama, etc.), with a prediction API model. Less flexible for custom Python logic but has a massive model library.
  • Beam — Newer, less hyped, but genuinely competitive for CPU-bound agent workloads. Cheaper at moderate scale with a simpler pricing model.

One thing that matters for all three: your Claude API calls are the dominant cost. These platforms handle the compute wrapper — the preprocessing, postprocessing, queueing, and orchestration logic around your LLM calls. When you’re also dealing with retry logic and graceful degradation patterns, the platform’s timeout behavior and concurrency model becomes load-bearing.

Modal: Best Developer Experience, Best for Complex Agents

How It Works

Modal lets you decorate regular Python functions and run them serverlessly. The mental model is clean: you define an image (Docker-like), attach it to a function, and Modal handles provisioning. Deployments take roughly 10-30 seconds for a cached image.

import modal
import anthropic

app = modal.App("claude-agent")

# Define the container environment
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "anthropic==0.28.0",
    "httpx",
)

@app.function(
    image=image,
    timeout=300,          # 5-minute max for long agent runs
    retries=2,            # automatic retry on transient failures
    concurrency_limit=50, # cap parallel executions
    secrets=[modal.Secret.from_name("anthropic-key")],
)
def run_agent(task: str, context: dict) -> dict:
    client = anthropic.Anthropic()
    
    # Your agent logic here — tool use, multi-turn, etc.
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        messages=[{"role": "user", "content": task}],
    )
    
    return {"result": response.content[0].text, "usage": response.usage.model_dump()}

# Call from anywhere
@app.local_entrypoint()
def main():
    result = run_agent.remote("Summarize this document...", {})
    print(result)

Pricing

Modal charges per second of compute: ~$0.000016/vCPU-second and ~$0.000025/GB-second for memory. For a Claude agent that does 5 seconds of Python compute per request (preprocessing + postprocessing), you’re looking at roughly $0.0001 per invocation in compute cost. The real cost is still your Anthropic API tokens — Modal’s compute overhead is almost noise by comparison.

There’s a free tier of $30/month credit. Paid usage scales linearly with no minimum commitment.

What’s Actually Good

Cold starts are the best I’ve seen: typically 1-3 seconds for a cached image, 15-30 seconds for a fresh pull. The `keep_warm` parameter lets you maintain warm instances: `@app.function(keep_warm=1)` keeps one container hot at all times at ~$2/month for a minimal CPU container. For production agents where latency matters, this is essential.

The web endpoint feature is genuinely useful — you can expose any function as an HTTPS endpoint with one decorator change. No API Gateway, no load balancer config.

Limitations

Stateful agents are awkward. Modal functions are stateless by design, so if your agent needs persistent memory across invocations, you’re managing that externally (Postgres, Redis, etc.). This is the right architectural choice, but it’s extra work. Also, the free tier gets rate-limited aggressively during their platform load spikes — I’ve seen deployments queue for 30+ seconds on free tier during peak hours.

Replicate: Great for Model Hosting, Awkward for Custom Agent Logic

How It Works

Replicate’s model is different: you push a Docker image defining a Cog model, and Replicate runs it as a prediction endpoint. It was designed for stateless ML inference — input goes in, prediction comes out. Adapting this to Claude agent workflows requires some gymnastics.

# predict.py — Replicate's Cog format
import cog
from anthropic import Anthropic

class Predictor(cog.BasePredictor):
    def setup(self):
        # Runs once on container startup
        self.client = Anthropic()
    
    def predict(
        self,
        task: str = cog.Input(description="Agent task"),
        system_prompt: str = cog.Input(description="System prompt", default=""),
        max_tokens: int = cog.Input(description="Max tokens", default=2048),
    ) -> str:
        # No persistent state between calls — each prediction is isolated
        messages = [{"role": "user", "content": task}]
        
        response = self.client.messages.create(
            model="claude-opus-4-5",
            max_tokens=max_tokens,
            system=system_prompt,
            messages=messages,
        )
        return response.content[0].text

Pricing

Replicate charges by hardware type per second: CPU instances at $0.000100/second, Nvidia T4 at $0.000225/second. For a pure Claude agent (no local GPU needed), you’re on CPU. That’s $0.006/minute — about 6x Modal’s CPU pricing. The tradeoff is the model library and ecosystem if you need open-source models alongside Claude.

What’s Actually Good

If your agent workflow needs both Claude and open-source models (say, a vision model for image preprocessing before sending to Claude), Replicate’s model library is genuinely valuable. You can call Llama, SDXL, Whisper, and others via the same API client with no separate infrastructure. The streaming output support is solid and works reliably.

Limitations

The Cog format adds friction that Modal doesn’t. Custom Python dependencies require rebuilding and pushing images, which takes 5-10 minutes. Cold starts are longer — 20-60 seconds is common for CPU models that haven’t been recently active. For interactive agent workloads, this is painful.

More importantly, complex agent orchestration logic (multi-turn loops, tool dispatch, conditional branching) doesn’t fit cleanly into the input/output prediction model. You end up either cramming everything into one giant predict() call or managing orchestration externally. Neither is ideal.

Beam: Underrated for High-Volume CPU Agent Workloads

How It Works

Beam is the least well-known of the three but has quietly improved a lot. Like Modal, it runs arbitrary Python functions serverlessly. The API surface is smaller but the pricing is more aggressive for CPU-heavy workloads.

import beam

# Define your runtime
runtime = beam.Runtime(
    cpu=1,
    memory="512Mi",
    python_version="python3.11",
    python_packages=["anthropic==0.28.0"],
)

@beam.endpoint(runtime=runtime, timeout=120)
def run_claude_task(task: str, **kwargs):
    from anthropic import Anthropic
    import os
    
    client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    
    response = client.messages.create(
        model="claude-haiku-4-5",  # Haiku for high-volume tasks
        max_tokens=1024,
        messages=[{"role": "user", "content": task}],
    )
    
    return {"output": response.content[0].text}

Pricing

Beam prices at approximately $0.000010/vCPU-second — meaningfully cheaper than Modal for the same CPU compute. On a workload doing 1,000 agent runs/day with 5 seconds of Python compute each, that’s about $0.05/day in compute vs Modal’s ~$0.08/day. Not earth-shattering, but at 100,000 runs/day it adds up to ~$150/month difference.

What’s Actually Good

Beam’s async task queue is well-implemented and easier to reason about than Modal’s `spawn()` for fire-and-forget batch jobs. If you’re running large-scale document processing with Claude, Beam’s queue-first model handles backpressure cleanly.

The deployment CLI is simple and the environment variable management is straightforward. Less magic than Modal, which is sometimes what you want.

Limitations

The ecosystem and documentation are thinner. Modal’s docs are genuinely excellent; Beam’s are adequate. Debugging failures is harder — the observability tooling (logs, traces, metrics) is basic compared to Modal. If you care about production monitoring, you’ll need to add your own layer. The community is smaller, so when you hit something undocumented, you’re on your own or waiting for support.

Cold starts are comparable to Modal (2-5 seconds cached), but the warm instance feature is less configurable.

Side-by-Side Comparison

Feature Modal Replicate Beam
CPU pricing (per vCPU-second) ~$0.000016 ~$0.000100 ~$0.000010
Cold start (cached image) 1–3 seconds 20–60 seconds 2–5 seconds
Custom Python logic Excellent Limited (Cog format) Good
GPU support Excellent Excellent Limited
Web endpoint (HTTPS) Yes (1 decorator) Yes (via API) Yes
Warm instances Yes (keep_warm) No direct control Yes (limited)
Concurrency control Per-function limit Instance-level Queue-based
Free tier $30/month credit Limited free predictions $10/month credit
Max timeout 24 hours 10 minutes (CPU) 24 hours
Observability / DX Excellent Good Adequate
Open-source model library None built-in Extensive None built-in
Best for Complex agents, long tasks ML model hosting + Claude hybrid High-volume CPU batch

Real Failure Modes to Know Before You Commit

Modal: The 50-container concurrency limit on the free/starter tier catches people off guard. If you’re running a high-traffic agent endpoint and hit this, requests start queuing silently. Also, large image builds (anything with heavy ML deps) can take 10+ minutes on first deploy. Pin your dependencies — I’ve had `pip install anthropic` pull in a breaking point release during a deployment.

Replicate: The 10-minute timeout on CPU models is a hard wall. If your agent needs to do extended reasoning chains — the kind of multi-step orchestration you’d build with the Claude Agent SDK — you’ll hit this. Replicate also occasionally has model loading queues during high demand that add unpredictable latency. Not acceptable for user-facing agents.

Beam: The smaller team means slower response to platform issues. I’ve seen Beam have 20-30 minute deployment delays during infrastructure maintenance with no status page update. For anything customer-facing, this is a risk. Also, their concurrency auto-scaling is less predictable than Modal’s under sudden traffic spikes.

Cost Reality Check: Running 10,000 Claude Agent Calls Per Day

Let’s ground this. Assume each agent invocation: 5 seconds of Python compute, calls Claude Haiku (at $0.00025/1k input tokens + $0.00125/1k output tokens), processes ~500 input tokens and returns ~300 output tokens.

  • Anthropic API cost per call: ~$0.000125 + ~$0.000375 = ~$0.0005
  • Modal compute per call: 5s × $0.000016 = ~$0.00008
  • Replicate compute per call: 5s × $0.000100 = ~$0.0005
  • Beam compute per call: 5s × $0.000010 = ~$0.00005

At 10,000 calls/day: Modal adds ~$0.80/day in compute, Replicate adds ~$5/day, Beam adds ~$0.50/day. The Anthropic costs are $5/day regardless. So Replicate costs 6x more in compute than Beam for the same Claude workload. Over a month, that’s an extra ~$130 in compute alone — not catastrophic but not nothing for a solo founder.

Verdict: Choose Your Platform by Workload Type

Choose Modal if: you’re building complex, multi-step Claude agents with tool use, long timeouts, or workflows that mix CPU processing with LLM calls. The developer experience alone justifies the slight cost premium over Beam. If you’re the kind of person who also cares about structured output validation and hallucination reduction patterns, Modal’s Python-native environment makes adding those verification layers trivial. This is my default recommendation for most teams.

Choose Replicate if: you need open-source models alongside Claude in the same pipeline — image classification before Claude vision analysis, Whisper transcription feeding into Claude summarization, that kind of hybrid architecture. Don’t use it for pure Claude agent hosting; you’ll overpay and fight the timeout limits.

Choose Beam if: you’re running genuinely high-volume, relatively simple agent tasks — document classification, content generation at scale, batch processing jobs where the per-call economics matter more than developer ergonomics. The cost savings at volume are real, but you’re accepting thinner tooling and a smaller community.

The definitive recommendation for the most common case: If you’re a solo founder or small team building a Claude-powered product with agent workflows, start with Modal. The developer experience reduces your iteration time, the warm instance feature keeps latency predictable, and the pricing is reasonable enough that you won’t need to migrate until you’re at real scale. Switch to Beam when your compute bill is meaningfully impacting margins — you’ll have enough production data by then to know exactly what you’re optimizing for in a serverless platform for Claude agents.

Frequently Asked Questions

Can I run Claude agents with persistent memory on Modal, Replicate, or Beam?

All three platforms are stateless by design — each function invocation is isolated. To add persistent memory, you’ll need an external store like Redis or Postgres and pass relevant state into each invocation. Modal makes this the easiest to implement cleanly because of its flexible Python environment. See our guide on building Claude agents with persistent memory across sessions for the full architecture pattern.

What is the cheapest way to host Claude agents serverlessly?

Beam is the cheapest for pure CPU compute at roughly $0.000010/vCPU-second. However, the dominant cost for Claude agent workloads is almost always the Anthropic API tokens, not the compute wrapper. At moderate volume (under 50,000 calls/day), the compute cost difference between Beam and Modal is under $50/month. Optimize your prompt token usage before obsessing over compute platform costs.

Does Replicate support long-running Claude agent workflows?

No — Replicate enforces a 10-minute timeout on CPU instances, which is a hard limit. Multi-step agents with tool use loops, chained reasoning, or document processing pipelines regularly exceed this. For long-running agent workflows, Modal (24-hour max) or Beam (24-hour max) are better choices.

How do cold starts affect Claude agent performance on these platforms?

For user-facing agents, cold starts are the most painful failure mode. Modal’s cached image cold starts (1-3 seconds) are best-in-class. Beam is comparable at 2-5 seconds. Replicate is significantly slower at 20-60 seconds for CPU models, making it unsuitable for interactive agent endpoints. All three support warm instance configurations to eliminate cold starts entirely for a small additional cost.

Can I use these platforms alongside n8n or Make for agent orchestration?

Yes — all three expose HTTPS endpoints that n8n and Make can call as HTTP request nodes. Modal’s web endpoint decorator is the cleanest way to do this: one decorator turns a Python function into an authenticated HTTPS endpoint. This works well for hybrid architectures where n8n handles orchestration and Modal/Beam handle compute-heavy agent tasks.

Put this into practice

Try the Ai Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply