Sunday, April 5

If you’re running Claude agents in production, you’ve probably already hit the wall with traditional deployment. You need something that scales to zero between runs, handles GPU bursts without pre-provisioning, and doesn’t charge you for idle time while your orchestration layer waits on API responses. That’s where serverless AI deployment platforms like Modal, Replicate, and Beam come in — and picking the wrong one for your workload pattern will cost you either money, latency, or both.

I’ve deployed Claude-backed agents on all three. Here’s what actually matters in production and where each platform quietly fails you.

What “Serverless” Actually Means for Agent Workloads

Agent workloads are weird. Unlike a single LLM call, an agent might run for 30 seconds or 8 minutes depending on tool use depth. It might spawn sub-agents. It might need a GPU for embedding retrieval but CPU-only for the Claude API orchestration layer. Most serverless platforms were designed for stateless functions or model inference — not orchestration loops with variable runtimes.

Before benchmarking, nail down your workload profile:

  • Run duration: Sub-10s, 10s–2min, or long-running (2–10+ min)?
  • Concurrency: Bursts of parallel agents or serial queue?
  • GPU dependency: Are you running local models alongside Claude, or is Claude doing all inference?
  • State requirements: Does your agent need persistent memory between steps, or is each invocation self-contained?

Your answers determine which platform’s pricing model screws you least.

Modal: The Engineer’s Platform

What It Does Well

Modal is built by engineers who actually run ML workloads. The Python-native decorator syntax is genuinely pleasant, cold starts on CPU containers are typically under 2 seconds, and you can define your container image inline with pip installs that get cached between deploys. For Claude agent orchestration — where you’re mostly running Python and hitting external APIs — Modal’s CPU instances are fast and cheap.

import modal
from anthropic import Anthropic

app = modal.App("claude-agent")

# Image is built once and cached — no cold start rebuild on every invocation
image = modal.Image.debian_slim().pip_install("anthropic", "httpx")

@app.function(
    image=image,
    timeout=600,  # 10 minutes max — good for longer agent runs
    retries=2,
    # CPU-only since Claude handles inference; we just need orchestration
    cpu=2,
    memory=1024,
)
def run_agent(task: str) -> dict:
    client = Anthropic()
    messages = []
    
    # Simplified agent loop — real implementation would include tool use
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2048,
        messages=[{"role": "user", "content": task}]
    )
    
    return {"result": response.content[0].text, "tokens": response.usage.input_tokens}

The decorator-based approach means your deployment is version-controlled alongside your agent logic. No YAML, no Dockerfile, no separate CI pipeline for the infra layer. That alone saves hours per week on a team.

Modal Pricing (Current at Time of Writing)

Modal charges by compute-second. CPU time runs roughly $0.000016 per vCPU-second, and memory is around $0.000002 per GB-second. A 30-second agent run on 2 vCPU / 1GB RAM costs approximately $0.00114 in compute — essentially nothing. Add your Anthropic API costs on top.

GPU pricing is where it gets interesting. An A10G costs around $0.000583 per second ($2.10/hour effective). If you’re running a local embedding model alongside Claude, Modal’s GPU cold starts are the Achilles heel — expect 15–40 seconds to initialize a GPU container with a loaded model. For sporadic requests, this is brutal.

Where Modal Falls Down

Modal’s web endpoint feature (turning functions into HTTP endpoints) works well, but the dashboard is basic and log aggregation is primitive. If you’re debugging a complex multi-step agent failure, you’ll be grep-ing through raw output. There’s no built-in tracing. Also: their free tier is generous for development (roughly $30/month credit), but their support for async generator patterns — useful for streaming agent outputs — has had rough edges in past versions. Pin to a specific Modal version in your requirements.

Replicate: Best for Model-Heavy Pipelines

Where It Shines

Replicate’s core value prop is its model registry — thousands of pre-packaged models you can call with a single API request. If your Claude agent needs to call an image generation model, a transcription model, or a custom fine-tune, Replicate makes that trivially easy. You don’t manage containers; you just call model versions by hash.

import replicate
from anthropic import Anthropic

anthropic_client = Anthropic()

def agent_with_vision_tool(user_request: str, image_url: str) -> str:
    """Agent that can call a vision model as a tool via Replicate."""
    
    # Step 1: Ask Claude to plan
    plan = anthropic_client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        messages=[{
            "role": "user", 
            "content": f"I need to analyze an image. Request: {user_request}"
        }]
    )
    
    # Step 2: Call a vision model on Replicate as a tool
    # Model version pinned by hash — reproducible across deploys
    output = replicate.run(
        "yorickvp/llava-13b:b5f6212d032508382d61ff00469ddda3e32fd8a0755a17f6dd602594f40c532",
        input={
            "image": image_url,
            "prompt": user_request,
            "max_tokens": 512,
        }
    )
    
    # Replicate returns a generator for streamed output
    result = "".join(output)
    return result

Replicate Pricing Reality

Replicate charges per-second of hardware use. An Nvidia T4 GPU runs at $0.000225/second (~$0.81/hour). An A100-80GB is $0.001400/second (~$5.04/hour). The issue: you’re paying for cold start time too. A cold Llama-3 8B container on Replicate takes 20–60 seconds to spin up. At T4 rates, that’s $0.005–$0.013 per cold start before your model does anything useful.

Replicate also has a per-prediction overhead — you’re using their managed infrastructure with a markup baked in. For commodity model inference at scale, it gets expensive fast compared to Modal where you control the container directly.

Replicate’s Real Limitation for Agent Workloads

Replicate is fundamentally designed for model inference, not orchestration. You can’t deploy arbitrary Python code with stateful loops, retry logic, or complex tool-calling orchestration as a first-class Replicate deployment. You’d deploy your Claude orchestration layer elsewhere and use Replicate as a remote model provider. That introduces a network hop and adds latency per tool call — typically 100–500ms of overhead depending on container state. For agents making 10+ tool calls, this compounds.

Beam: The Underdog Worth Considering

What Beam Actually Offers

Beam (beam.cloud) is the least talked-about of the three, which is a shame because for specific workloads it’s genuinely the right answer. It sits between Modal’s developer experience and Replicate’s simplicity. Like Modal, you write Python and decorate functions. Like Replicate, the focus is compute-per-task rather than long-running services.

import beam
from beam import App, Runtime, Image, Output

# Beam uses a different abstraction — apps with explicit runtime specs
app = App(
    name="claude-batch-processor",
    runtime=Runtime(
        cpu=4,
        memory="4Gi",
        image=Image(python_packages=["anthropic>=0.30.0"]),
    ),
)

@app.task_queue(outputs=[Output(path="results")])
def process_batch(task_batch: list) -> None:
    """Process a batch of tasks using Claude — runs as async task queue."""
    from anthropic import Anthropic
    import json
    
    client = Anthropic()
    results = []
    
    for task in task_batch:
        response = client.messages.create(
            model="claude-haiku-4-5",  # Haiku for batch cost efficiency
            max_tokens=1024,
            messages=[{"role": "user", "content": task}]
        )
        results.append({
            "task": task,
            "result": response.content[0].text
        })
    
    # Write output — Beam's output system handles persistence
    with open("results/output.json", "w") as f:
        json.dump(results, f)

Beam Pricing and Cold Starts

Beam’s CPU pricing is competitive with Modal — roughly in the same range. Their GPU pricing is slightly higher for A10Gs but they’ve been more aggressive about pricing transparency. Cold starts for CPU workloads are generally under 3 seconds. GPU cold starts are similar to Modal — 15–30 seconds for loaded models.

The important caveat: Beam is smaller and earlier-stage than Modal. I’ve hit rate limits at moderate scale (~50 concurrent runs) that required a support ticket to resolve. Modal’s infrastructure handles bursts more gracefully at high concurrency. If you’re running production workloads expecting 100+ concurrent agent runs, Modal has a more proven track record.

Beam’s Best Use Case

Beam’s task queue abstraction is genuinely well-designed for batch agent workloads — processing 500 documents overnight, running evaluation suites, batch enrichment pipelines. The built-in queue semantics handle retries and dead-letter queuing without you wiring up Redis or SQS. For async, high-volume batch work where latency doesn’t matter but throughput and cost do, Beam deserves serious consideration.

Head-to-Head: The Numbers That Actually Matter

For a representative Claude agent run: 45 seconds elapsed, 2 vCPU / 1GB RAM, no GPU, making 5 tool calls to external APIs:

  • Modal: ~$0.0017 compute cost (plus Anthropic API costs)
  • Beam: ~$0.0018 compute cost (similar, slightly less data on pricing granularity)
  • Replicate: Not applicable directly — you’d deploy orchestration elsewhere and call Replicate for models

Cold start latency for CPU containers (cached image, no model loading):

  • Modal: 1.5–3s average
  • Beam: 2–4s average
  • Replicate: 5–15s for simple containers; 20–60s for GPU model containers

Which Serverless AI Deployment Platform Should You Actually Use?

Stop reading comparison articles (including this one) if you haven’t profiled your workload. Once you have:

Use Modal if: You’re building Claude agent orchestration in Python, you need reliable production infrastructure, your team writes code (not YAML), and you want the ecosystem to still exist in 2 years. Modal has the strongest engineering team, the best Python-native developer experience, and handles burst concurrency better than the alternatives. This is my default recommendation for 80% of agent workloads.

Use Replicate if: Your agent is fundamentally model-heavy — you need access to dozens of community models as tools, you’re building something that composes multiple open-source models, and your orchestration logic lives elsewhere (n8n, a FastAPI server, a Claude.ai integration). Replicate is a model registry with an API, not a general compute platform, and it excels at exactly that.

Use Beam if: You’re running async batch agent workloads — document processing pipelines, evaluation harnesses, overnight enrichment jobs — and you want built-in queue semantics without managing separate infrastructure. Just don’t run your latency-sensitive, user-facing agent on Beam yet. It’s not there.

For solo founders watching burn rate: start with Modal’s free tier. The $30/month credit gets you thousands of agent runs before you pay anything. When your Claude API bill starts looking significant, that’s when you optimize infrastructure — not before.

For teams at scale: Modal still wins for serverless AI deployment platforms, but invest in proper observability (OpenTelemetry + your tracing layer of choice) before you hit production. None of these three platforms give you adequate debugging tools out of the box for complex agent failures.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply