Multi-agent workflows with Claude: designing systems where agents collaborate

Most teams building with Claude start by making one agent smarter. They tune the prompt, add tools, refine the system prompt. Then they hit a wall: the task is genuinely too complex, too long, or requires too many conflicting capabilities in a single context window. That’s when multi-agent workflows with Claude stop being a curiosity and become a real architecture decision.

The gap in most guides is that they either describe toy examples (two agents passing a string back and forth) or hand-wave over the hard parts: how agents communicate reliably, how you prevent error cascades, how you handle disagreement between agents, and how you keep costs from spiraling when you’re running four model calls per user request instead of one.

This article covers the production realities: communication protocols, task decomposition strategies, voting and consensus patterns, and failure recovery — with working Python code throughout.

Why a Single Claude Agent Eventually Breaks Down

A single Claude Sonnet call can handle a surprising amount of complexity. But there are failure modes that don’t yield to prompt engineering:

Context length limits: Claude 3.5 Sonnet has a 200k token context window, which sounds vast until you’re processing a 300-page legal document plus conversation history plus tool outputs.
Specialization vs. generalism: An agent asked to simultaneously write production Rust, audit security, and estimate business impact will be mediocre at all three. Separate agents with focused system prompts consistently outperform a single generalist.
Parallel workloads: If three independent subtasks each take 8 seconds, running them sequentially costs 24 seconds. Running them in parallel costs 8 seconds plus a small coordination overhead.
Error isolation: When a single agent fails mid-task, you lose everything. With decomposed agents, a failure in one node can be retried without restarting the entire pipeline.

None of these are unsolvable with a single agent. But they all get substantially easier with a multi-agent architecture.

The Three Core Topologies

Before writing a line of code, pick your topology. The choice determines your communication model, your failure modes, and your cost structure.

Orchestrator-Worker (Hub and Spoke)

One orchestrator agent breaks down the task and dispatches subtasks to specialized worker agents. Workers return results; the orchestrator synthesizes them. This is the most common pattern and the easiest to debug because there’s a single coordination point.

Use when: tasks decompose cleanly, workers don’t need to communicate with each other, you want a single audit trail.

Pipeline (Sequential Agents)

Agent A processes input and passes output to Agent B, which passes to Agent C. Each agent transforms or enriches the artifact. Think: researcher → drafter → editor → fact-checker.

Use when: each stage depends on the previous stage’s full output, and you want each agent to have a clean, narrow responsibility. This is also where structured output verification between stages pays dividends — each handoff is a natural checkpoint.

Peer-to-Peer (Mesh)

Agents communicate directly with each other without a central coordinator. Scales poorly and is hard to debug. I’d avoid this until you have specific reasons a hub-and-spoke won’t work.

Task Decomposition: The Part Everyone Underestimates

The orchestrator’s job is to take a complex task and produce a dependency graph of subtasks. Get this wrong and you’ll either create unnecessary serialization (slow) or dispatch tasks with missing dependencies (broken).

Here’s a working orchestrator implementation:

import anthropic
import json
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()

@dataclass
class Subtask:
    id: str
    description: str
    depends_on: list[str]  # IDs of tasks that must complete first
    worker_type: str        # "researcher", "writer", "reviewer", etc.
    result: Optional[str] = None

def orchestrate(task: str) -> list[Subtask]:
    """Ask Claude to decompose a task into a dependency graph."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        system="""You are a task decomposition specialist. Break complex tasks into
        independent subtasks with explicit dependencies. Return valid JSON only.
        Format: {"subtasks": [{"id": "t1", "description": "...", "depends_on": [],
        "worker_type": "researcher"}, ...]}""",
        messages=[{"role": "user", "content": f"Decompose this task: {task}"}]
    )
    
    data = json.loads(response.content[0].text)
    return [Subtask(**s) for s in data["subtasks"]]

def run_worker(subtask: Subtask, context: dict[str, str]) -> str:
    """Run a specialized worker agent for a given subtask."""
    
    # Build context from completed dependencies
    dep_context = "\n".join([
        f"Result of {dep}: {context[dep]}"
        for dep in subtask.depends_on
        if dep in context
    ])
    
    system_prompts = {
        "researcher": "You are a rigorous research agent. Find relevant facts and sources.",
        "writer": "You are a precise technical writer. Use provided research.",
        "reviewer": "You are a critical reviewer. Find gaps, errors, and improvements.",
    }
    
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",  # Use Haiku for workers to cut cost ~5x
        max_tokens=1024,
        system=system_prompts.get(subtask.worker_type, "You are a helpful assistant."),
        messages=[{
            "role": "user",
            "content": f"Task: {subtask.description}\n\nContext from previous steps:\n{dep_context}"
        }]
    )
    return response.content[0].text

The key decision here: use Claude Sonnet for orchestration, Haiku for workers. Sonnet at $3/$15 per million tokens (input/output) handles complex reasoning. Haiku at $0.25/$1.25 per million tokens handles execution. On a typical 5-agent pipeline, this hybrid approach costs roughly $0.008–$0.015 per full run versus $0.04+ if you run everything on Sonnet. That difference matters at volume.

Multi-Agent Communication Protocols

Agents need a shared message format. Freeform strings between agents is a trap — you’ll end up with one agent producing markdown that the next agent misinterprets as JSON. Define a schema and enforce it with structured outputs.

from pydantic import BaseModel
from typing import Literal

class AgentMessage(BaseModel):
    sender_id: str
    task_id: str
    status: Literal["in_progress", "complete", "failed", "needs_clarification"]
    payload: str           # The actual content/result
    confidence: float      # 0.0-1.0, agent's self-assessed confidence
    flags: list[str] = []  # e.g., ["needs_human_review", "low_confidence"]

class TaskResult(BaseModel):
    task_id: str
    output: str
    sources: list[str] = []
    warnings: list[str] = []
    requires_escalation: bool = False

The confidence field is underused in most implementations but critical for consensus patterns and escalation logic. Force agents to report their own uncertainty — it’s surprisingly well-calibrated in Claude 3.5 models.

Voting and Consensus: When Agents Disagree

For high-stakes decisions — content moderation, financial categorization, security assessments — running a single agent and trusting its output is risky. A voting pattern runs the same task across multiple agents (or the same agent with different temperature/prompt variations) and aggregates the results.

import asyncio
from collections import Counter

async def consensus_vote(
    task: str,
    num_voters: int = 3,
    threshold: float = 0.67  # Supermajority required
) -> dict:
    """Run multiple agents and return majority verdict with confidence."""
    
    async def single_vote(voter_id: int) -> dict:
        # Vary temperature slightly to get independent assessments
        temp_variation = 0.3 + (voter_id * 0.1)
        
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=512,
            system=f"""You are analyst #{voter_id}. Provide an independent assessment.
            Return JSON: {{"verdict": "approve|reject|escalate", "reasoning": "...", 
            "confidence": 0.0-1.0}}""",
            messages=[{"role": "user", "content": task}]
        )
        return json.loads(response.content[0].text)
    
    # Run voters in parallel — this is the key performance win
    votes = await asyncio.gather(*[single_vote(i) for i in range(num_voters)])
    
    verdicts = [v["verdict"] for v in votes]
    verdict_counts = Counter(verdicts)
    top_verdict, top_count = verdict_counts.most_common(1)[0]
    
    agreement_ratio = top_count / num_voters
    avg_confidence = sum(v["confidence"] for v in votes) / num_voters
    
    return {
        "verdict": top_verdict if agreement_ratio >= threshold else "escalate",
        "agreement": agreement_ratio,
        "avg_confidence": avg_confidence,
        "individual_votes": votes,
        "requires_human": agreement_ratio < threshold
    }

Three Haiku voters running in parallel cost roughly $0.002 and take about 3 seconds. For decisions worth getting right, that’s cheap insurance. The escalation path when agents disagree is where you need solid fallback logic — see our guide on graceful degradation patterns for how to handle these escalation queues without blocking your pipeline.

Failure Recovery Without Restarting Everything

This is where most implementations break in production. A worker fails at step 4 of 6. Do you restart the entire pipeline? No — you checkpoint.

import redis
import hashlib

class CheckpointedPipeline:
    def __init__(self, pipeline_id: str, redis_client: redis.Redis):
        self.pipeline_id = pipeline_id
        self.redis = redis_client
        self.ttl = 3600  # Checkpoint lives 1 hour
    
    def checkpoint_key(self, task_id: str) -> str:
        return f"pipeline:{self.pipeline_id}:task:{task_id}"
    
    def save_result(self, task_id: str, result: str):
        self.redis.setex(
            self.checkpoint_key(task_id),
            self.ttl,
            result
        )
    
    def get_result(self, task_id: str) -> str | None:
        result = self.redis.get(self.checkpoint_key(task_id))
        return result.decode() if result else None
    
    def run_with_checkpoint(self, subtask: Subtask, context: dict) -> str:
        # Check if this task already completed
        cached = self.get_result(subtask.id)
        if cached:
            print(f"Using checkpoint for {subtask.id}")
            return cached
        
        # Run the worker
        result = run_worker(subtask, context)
        
        # Save checkpoint immediately
        self.save_result(subtask.id, result)
        return result

With checkpointing, a retry of a failed pipeline skips all completed tasks and resumes from the failure point. In a 6-step pipeline where step 5 fails, you pay for steps 1-4 once and step 5 on retry only. This also matters during development — you’re not burning tokens re-running stable early stages every time you tweak a later prompt.

Three Common Misconceptions About Multi-Agent Systems

Misconception 1: More agents = better results

More agents means more API calls, more latency, more potential failure points, and more cost. A 3-agent pipeline with well-designed prompts will consistently beat a 7-agent pipeline where the extra agents are redundant. Start minimal. Add agents only when you have a specific, measurable reason.

Misconception 2: The orchestrator needs to be the most powerful model

Orchestration is mostly structured reasoning: “What are the subtasks? What’s the dependency order? Which worker handles which?” That’s well within Haiku’s capabilities for many use cases. Reserve Sonnet for the tasks that genuinely need deeper reasoning — complex synthesis, nuanced judgment calls, ambiguous instructions. For more nuance on model selection, the Claude vs GPT-4 benchmark breaks down where each model’s strengths actually lie.

Misconception 3: Agents communicate like microservices

Real microservices have defined APIs with strict contracts. LLM agents are probabilistic — they can misinterpret instructions, return malformed output, or confidently produce wrong answers. Your communication layer needs validation at every step, not just at the final output. This is why the AgentMessage schema above includes status codes, confidence scores, and flags. Treat inter-agent communication as untrusted input.

A Real Case Study: Document Intelligence Pipeline

A content analytics team was processing 500-page technical reports to extract: executive summary, key risks, financial figures, and a compliance checklist. Single-agent approach: consistently hitting context limits, taking 45+ seconds, and mixing up sections across documents.

Multi-agent redesign:

Splitter agent (Haiku): Chunks document into logical sections
Extractor agents (Haiku × 4, parallel): One each for summaries, risks, financials, compliance
Synthesizer agent (Sonnet): Merges and deduplicates findings
Reviewer agent (Haiku): Validates output completeness against a checklist

Results: Average processing time dropped from 47s to 14s (extractors run in parallel). Cost per document: $0.031 vs $0.028 for the single-agent approach — slightly more expensive, but within a rounding error given the quality and reliability improvement. Failures dropped from ~8% (context overruns) to under 1% (occasional Haiku API timeouts, handled by retry). The pattern also maps well to batch processing at scale if you’re handling thousands of documents daily.

Production Considerations You Won’t Find in the Docs

Prompt bleed between agents: If Agent B receives Agent A’s entire output as context, Agent B’s behavior will be influenced by Agent A’s tone, formatting preferences, and even errors. Be deliberate about what you pass between agents — extract structured data rather than passing raw prose when possible.

Token counting across the pipeline: Track cumulative token usage per pipeline run, not just per agent call. A pipeline that looks cheap per step can burn through budget fast when you account for the context passed between steps.

Idempotency matters: If your orchestrator crashes and restarts, it should be able to reconstruct state from checkpoints without re-dispatching tasks. Design every step to be safely re-runnable.

Rate limits compound: Five parallel Haiku calls all hit your rate limit simultaneously. Build exponential backoff at the pipeline level, not just the individual call level. Our error handling and fallback patterns guide covers the specific retry logic worth implementing here.

Frequently Asked Questions

How do I prevent agents from contradicting each other in multi-agent workflows with Claude?

Use a shared context object that all agents read from and write to, enforced through a structured schema. When agents produce conflicting outputs, route them to a synthesis agent with explicit instructions to reconcile contradictions — don’t let contradictions silently pass through. For high-stakes decisions, use the voting pattern described above and escalate to human review when agreement falls below your threshold.

What’s the cheapest way to run multi-agent pipelines with Claude?

Use Claude Haiku for all worker agents and reserve Sonnet only for orchestration or high-judgment synthesis tasks. Run independent tasks in parallel with asyncio to reduce wall-clock time without increasing API costs. Implement checkpointing so retries don’t re-run completed steps. At current pricing, a well-optimized 5-agent pipeline using this hybrid approach costs roughly $0.005–$0.015 per run depending on document complexity.

Can I build multi-agent Claude workflows in n8n or Make without writing Python?

Yes, but with significant limitations. n8n supports Claude via HTTP Request nodes and can chain multiple Claude calls using its workflow branching. What you lose is parallel execution (n8n workflows are largely sequential unless you use split-in-batches), tight error handling, and the ability to pass structured typed objects between nodes. For simple pipelines up to 4-5 sequential agents, n8n works fine. For parallel workloads or consensus patterns, Python gives you much more control.

How do I handle a worker agent that keeps returning malformed output?

Add a validation layer after every agent call. If the output doesn’t match your schema, retry with an augmented prompt that includes the bad output and explicit instructions to fix it (“Your previous response was invalid JSON. Here’s what you returned: [output]. Return valid JSON matching this schema: [schema]”). Set a max retry count (2-3 is usually sufficient) and escalate to a fallback model or human review if validation still fails.

What’s the difference between a multi-agent workflow and a simple chain of LLM calls?

A chain of LLM calls is sequential with no coordination logic — call A, pass output to call B, done. Multi-agent workflows add: explicit agent roles with specialized system prompts, parallel execution where possible, dependency management between tasks, communication protocols with validation, consensus/voting patterns, and failure recovery with checkpointing. The distinction matters when your pipeline needs to branch, retry, or recover gracefully — a naive chain breaks on the first failure.

How many agents is too many for a single workflow?

As a practical heuristic: if you can’t draw the dependency graph on a whiteboard and explain each agent’s role in one sentence, you have too many. Most production workflows stay between 3 and 8 agents. Beyond that, coordination overhead and debugging complexity start to outweigh the benefits. The most common mistake is creating agents for every noun in the problem description rather than every genuinely distinct reasoning task.

Bottom Line: When to Actually Build This

Solo founders on a budget: Don’t reach for multi-agent workflows until a single-agent approach is genuinely failing. The added complexity is real. Start with one well-crafted agent, instrument it with proper observability tooling, and only decompose when you have evidence of a specific bottleneck.

Teams processing structured documents at scale: Multi-agent pipelines with the orchestrator-worker topology will pay off quickly. The parallelism gains alone justify the architecture once you’re processing more than a few hundred documents per day.

High-stakes decision automation: Use voting and consensus patterns. Three Haiku agents running in parallel for ~$0.002 total, with escalation logic for disagreements, is far more reliable than trusting a single call — and cheaper than a single Sonnet call.

Teams already hitting rate limits: Decompose tasks to distribute load. Five Haiku workers processing subtasks in parallel are less likely to hit burst rate limits than one Sonnet agent making sequential calls.

The core principle behind effective multi-agent workflows with Claude is simple: every agent should do one thing well, communicate through a validated schema, and fail loudly rather than silently degrading. Get those three properties right and the rest of the architecture is mostly plumbing.

Put this into practice

Try the Connection Agent agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Multi-agent workflows with Claude: designing systems where agents collaborate

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Multi-agent workflows with Claude: designing systems where agents collaborate

Why a Single Claude Agent Eventually Breaks Down

The Three Core Topologies

Orchestrator-Worker (Hub and Spoke)

Pipeline (Sequential Agents)

Peer-to-Peer (Mesh)

Task Decomposition: The Part Everyone Underestimates

Multi-Agent Communication Protocols

Voting and Consensus: When Agents Disagree

Failure Recovery Without Restarting Everything

Three Common Misconceptions About Multi-Agent Systems

Misconception 1: More agents = better results

Misconception 2: The orchestrator needs to be the most powerful model

Misconception 3: Agents communicate like microservices

A Real Case Study: Document Intelligence Pipeline

Production Considerations You Won’t Find in the Docs

Frequently Asked Questions

How do I prevent agents from contradicting each other in multi-agent workflows with Claude?

What’s the cheapest way to run multi-agent pipelines with Claude?

Can I build multi-agent Claude workflows in n8n or Make without writing Python?

How do I handle a worker agent that keeps returning malformed output?

What’s the difference between a multi-agent workflow and a simple chain of LLM calls?

How many agents is too many for a single workflow?

Bottom Line: When to Actually Build This

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation