Claude Subagents and Hierarchical Orchestration: Building Multi-Agent Teams

Most agent failures aren’t model failures — they’re architecture failures. You give a single Claude instance a 12-step task, it hallucinates on step 7, and the whole thing collapses. The fix isn’t a better prompt. It’s Claude subagents orchestration: breaking that 12-step monster into specialized agents, each owning one concern, coordinated by an orchestrator that manages flow and handles errors. This article shows you exactly how to build that architecture, with working code you can adapt today.

Why Single-Agent Architectures Break Under Real Workloads

A single Claude instance has a finite context window (200K tokens for claude-3-5-sonnet, more than enough for most tasks individually), but the real problem is cognitive load — not tokens. When you ask one agent to research a topic, structure a report, validate facts, generate code examples, and format output for three different audiences, you get mediocre performance on all five. The model is task-switching in a way that compounds errors.

There’s also the cost angle. If a task needs lightweight classification, running it on claude-3-5-sonnet at roughly $3 per million input tokens is wasteful when claude-3-haiku at $0.25 per million handles it fine. A hierarchical design lets you route cheap work to cheap models.

Finally, debugging monolithic agents is miserable. When something breaks, you’re hunting through a single enormous prompt and response chain. Subagent architectures fail loudly and locally — a research subagent fails, you know exactly where to look.

The Mental Model: Orchestrators vs. Subagents

The distinction matters and documentation often blurs it:

Orchestrator: Owns the plan. Decides what to do next based on previous results. Doesn’t do real work itself — it delegates and synthesizes.
Subagent: Owns one capability. Has a tight, well-defined system prompt. Returns structured output. Knows nothing about the broader task.

A subagent should be ignorant of the overall goal by design. It gets a precise input, produces a precise output. This makes it testable in isolation, which is something you can’t do with a monolithic agent.

The orchestrator pattern also lets you run subagents in parallel (for independent tasks) or in series (when output B depends on output A). That decision alone — parallel vs. serial execution — has a massive impact on latency.

Building the Architecture: A Practical Implementation

Step 1: Define Subagent Contracts First

Before writing a single line of orchestration code, define what each subagent accepts and returns. Treat these like API contracts. Here’s an example for a content production pipeline:


from dataclasses import dataclass
from typing import Optional

@dataclass
class ResearchInput:
    topic: str
    depth: str  # "shallow" | "deep"
    max_sources: int = 5

@dataclass
class ResearchOutput:
    summary: str
    key_facts: list[str]
    gaps: list[str]  # things the research couldn't confirm
    confidence: float  # 0-1

@dataclass
class WritingInput:
    research: ResearchOutput
    tone: str
    word_count: int
    audience: str

@dataclass
class WritingOutput:
    content: str
    title: str
    meta_description: str

Forcing typed I/O upfront prevents a huge class of orchestration bugs where one agent returns something the next one can’t parse. It also makes the system prompt for each subagent far simpler — you’re just telling it to fill a schema.

Step 2: Build Tight Subagent System Prompts

A subagent’s system prompt should be under 200 words. If you need more, the subagent is doing too much. Here’s an example for the research subagent:


import anthropic
import json

client = anthropic.Anthropic()

RESEARCH_SYSTEM_PROMPT = """
You are a research specialist. You receive a topic and produce structured research output.

Return ONLY valid JSON matching this schema:
{
  "summary": "2-3 sentence overview",
  "key_facts": ["fact 1", "fact 2", ...],
  "gaps": ["thing you couldn't confirm", ...],
  "confidence": 0.0-1.0
}

Rules:
- key_facts: 3-7 items, each under 50 words
- gaps: be honest, do not fabricate sources
- confidence: reflect how well-established the information is
- No prose outside the JSON block
"""

def run_research_agent(topic: str, depth: str = "shallow") -> dict:
    model = "claude-3-haiku-20240307" if depth == "shallow" else "claude-3-5-sonnet-20241022"
    
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=RESEARCH_SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": f"Research topic: {topic}"
        }]
    )
    
    # Extract JSON from response
    raw = response.content[0].text
    return json.loads(raw)  # will raise if agent went off-script

Notice the model selection is inside the subagent function, not the orchestrator. The orchestrator passes depth as a parameter, but the subagent decides the model. This encapsulates cost decisions where they belong.

Step 3: Write the Orchestrator

The orchestrator’s job is to manage state, handle errors, and decide what to call next. It uses a more capable model because it’s doing planning and synthesis:


import asyncio
from typing import Optional

async def run_content_pipeline(
    topic: str,
    audience: str,
    word_count: int = 800
) -> dict:
    
    pipeline_state = {
        "topic": topic,
        "errors": [],
        "research": None,
        "content": None
    }
    
    # Step 1: Research (use shallow for speed, deep only if needed)
    try:
        research_result = run_research_agent(topic, depth="shallow")
        
        # If confidence is low, re-run with deep research
        if research_result["confidence"] < 0.6:
            print(f"Low confidence ({research_result['confidence']}), escalating to deep research")
            research_result = run_research_agent(topic, depth="deep")
        
        pipeline_state["research"] = research_result
        
    except json.JSONDecodeError as e:
        pipeline_state["errors"].append(f"Research agent returned invalid JSON: {e}")
        return pipeline_state  # fail fast, don't proceed with bad data
    
    # Step 2: Writing (depends on research output)
    try:
        writing_result = run_writing_agent(
            research=research_result,
            audience=audience,
            tone="professional",
            word_count=word_count
        )
        pipeline_state["content"] = writing_result
        
    except Exception as e:
        pipeline_state["errors"].append(f"Writing agent failed: {e}")
        return pipeline_state
    
    return pipeline_state


# Run it
result = asyncio.run(run_content_pipeline(
    topic="vector databases in production",
    audience="senior engineers",
    word_count=1000
))

The fail-fast pattern on line 28 is important — don’t pass garbage forward. If the research agent returns malformed output, stop the pipeline and surface the error. Silent failures in multi-agent systems are brutal to debug.

Parallel Subagent Execution

When subagents are independent, run them concurrently. Here’s where the latency wins are:


import asyncio
import anthropic

client = anthropic.Anthropic()

async def run_agent_async(system_prompt: str, user_message: str, model: str) -> str:
    """Async wrapper around the synchronous Anthropic client."""
    loop = asyncio.get_event_loop()
    
    response = await loop.run_in_executor(
        None,  # uses default thread pool
        lambda: client.messages.create(
            model=model,
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": user_message}]
        )
    )
    return response.content[0].text

async def run_parallel_analysis(document: str) -> dict:
    """Run sentiment, summary, and entity extraction in parallel."""
    
    sentiment_task = run_agent_async(
        system_prompt="Extract sentiment from text. Return JSON: {sentiment: positive|negative|neutral, score: 0-1}",
        user_message=document,
        model="claude-3-haiku-20240307"
    )
    
    summary_task = run_agent_async(
        system_prompt="Summarize in 2 sentences. Return JSON: {summary: string}",
        user_message=document,
        model="claude-3-haiku-20240307"
    )
    
    entity_task = run_agent_async(
        system_prompt="Extract named entities. Return JSON: {people: [], orgs: [], places: []}",
        user_message=document,
        model="claude-3-haiku-20240307"
    )
    
    # All three run concurrently — latency is max(t1, t2, t3), not t1+t2+t3
    sentiment, summary, entities = await asyncio.gather(
        sentiment_task, summary_task, entity_task
    )
    
    return {
        "sentiment": json.loads(sentiment),
        "summary": json.loads(summary),
        "entities": json.loads(entities)
    }

In practice, running three Haiku calls in parallel takes roughly 1.2–1.8 seconds. Running them serially takes 3–5 seconds. For latency-sensitive workflows, parallel execution is non-negotiable.

State Management Between Subagents

The orchestrator should own all state. Subagents should never need to call each other directly — that creates coupling that will bite you. Pass state explicitly:


class PipelineState:
    def __init__(self, task_id: str):
        self.task_id = task_id
        self.steps_completed: list[str] = []
        self.artifacts: dict = {}  # named outputs from each subagent
        self.errors: list[dict] = []
        self.total_tokens_used: int = 0
    
    def record_step(self, step_name: str, output: dict, tokens: int):
        self.steps_completed.append(step_name)
        self.artifacts[step_name] = output
        self.total_tokens_used += tokens
    
    def get_artifact(self, step_name: str) -> Optional[dict]:
        return self.artifacts.get(step_name)
    
    def estimated_cost_usd(self) -> float:
        # Rough estimate, assumes mix of Haiku and Sonnet
        return self.total_tokens_used * 0.000001  # ~$1 per million tokens blended

The estimated_cost_usd() method is something I add to every production pipeline. Cost visibility during development prevents surprises. Once you’re tracking tokens per step, you’ll find the expensive ones quickly — usually it’s the orchestrator itself, not the subagents.

What Actually Breaks in Production

Here’s where the documentation stops and experience starts:

JSON Parsing Failures

Even with explicit instructions, Claude will occasionally wrap JSON in a markdown code block (```json ... ```). Build a robust extractor:


import re

def extract_json(raw: str) -> dict:
    """Extract JSON from response that may contain markdown code blocks."""
    # Try direct parse first
    try:
        return json.loads(raw.strip())
    except json.JSONDecodeError:
        pass
    
    # Try extracting from code block
    match = re.search(r'```(?:json)?\s*([\s\S]*?)```', raw)
    if match:
        try:
            return json.loads(match.group(1).strip())
        except json.JSONDecodeError:
            pass
    
    raise ValueError(f"Could not extract JSON from response: {raw[:200]}")

Token Budget Overruns

Set max_tokens aggressively for each subagent based on expected output size. A classification subagent should never get 2048 tokens — that’s a signal it went off-script. If it hits your limit, that’s an error condition, not a truncation.

Orchestrator Confusion

If you use Claude as your orchestrator and pass it too much subagent output, it starts treating subagent prose as instructions. Keep orchestrator context lean — pass summaries of subagent outputs, not full responses, when possible.

Cost Reality for This Architecture

A realistic 5-step pipeline using this pattern — one orchestrator call on Sonnet plus four subagent calls on Haiku — costs roughly:

Orchestrator (Sonnet): ~2000 tokens × $3/M = $0.006
Four Haiku subagents: ~4000 tokens each × $0.25/M = $0.004 total
Total per run: ~$0.01

At scale, 10,000 runs per day is $100/day — manageable. The architectural overhead vs. a monolithic Sonnet call is minimal, and you get far better reliability and debuggability in return.

When to Use This Pattern (and When Not To)

Use hierarchical subagent orchestration when:

Your task has 4+ distinct steps with different skill requirements
You need parallel execution across independent subtasks
You want to mix model tiers for cost optimization
Debugging and observability matter (production systems)

Skip it when:

The task is genuinely simple and fits in one well-crafted prompt
Latency is critical and you can’t afford the orchestration overhead
You’re prototyping and the extra complexity isn’t worth it yet

For solo founders building their first agent: start monolithic, profile where it breaks, then extract subagents only for the steps that consistently fail. Don’t design for the architecture — let the failures tell you what to decompose.

For teams building production pipelines: the investment in typed subagent contracts and explicit state management pays off within weeks. Claude subagents orchestration done right means you can swap out individual subagents (different models, different prompts) without touching the rest of the system. That’s the real payoff — a pipeline you can actually iterate on.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Claude Subagents and Hierarchical Orchestration: Building Multi-Agent Teams

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Claude Subagents and Hierarchical Orchestration: Building Multi-Agent Teams

Why Single-Agent Architectures Break Under Real Workloads

The Mental Model: Orchestrators vs. Subagents

Building the Architecture: A Practical Implementation

Step 1: Define Subagent Contracts First

Step 2: Build Tight Subagent System Prompts

Step 3: Write the Orchestrator

Parallel Subagent Execution

State Management Between Subagents

What Actually Breaks in Production

JSON Parsing Failures

Token Budget Overruns

Orchestrator Confusion

Cost Reality for This Architecture

When to Use This Pattern (and When Not To)

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation