Most agent failures aren’t model failures — they’re architecture failures. You give a single Claude instance a 12-step task, it hallucinates on step 7, and the whole thing collapses. The fix isn’t a better prompt. It’s Claude subagents orchestration: breaking that 12-step monster into specialized agents, each owning one concern, coordinated by an orchestrator that manages flow and handles errors. This article shows you exactly how to build that architecture, with working code you can adapt today.
Why Single-Agent Architectures Break Under Real Workloads
A single Claude instance has a finite context window (200K tokens for claude-3-5-sonnet, more than enough for most tasks individually), but the real problem is cognitive load — not tokens. When you ask one agent to research a topic, structure a report, validate facts, generate code examples, and format output for three different audiences, you get mediocre performance on all five. The model is task-switching in a way that compounds errors.
There’s also the cost angle. If a task needs lightweight classification, running it on claude-3-5-sonnet at roughly $3 per million input tokens is wasteful when claude-3-haiku at $0.25 per million handles it fine. A hierarchical design lets you route cheap work to cheap models.
Finally, debugging monolithic agents is miserable. When something breaks, you’re hunting through a single enormous prompt and response chain. Subagent architectures fail loudly and locally — a research subagent fails, you know exactly where to look.
The Mental Model: Orchestrators vs. Subagents
The distinction matters and documentation often blurs it:
- Orchestrator: Owns the plan. Decides what to do next based on previous results. Doesn’t do real work itself — it delegates and synthesizes.
- Subagent: Owns one capability. Has a tight, well-defined system prompt. Returns structured output. Knows nothing about the broader task.
A subagent should be ignorant of the overall goal by design. It gets a precise input, produces a precise output. This makes it testable in isolation, which is something you can’t do with a monolithic agent.
The orchestrator pattern also lets you run subagents in parallel (for independent tasks) or in series (when output B depends on output A). That decision alone — parallel vs. serial execution — has a massive impact on latency.
Building the Architecture: A Practical Implementation
Step 1: Define Subagent Contracts First
Before writing a single line of orchestration code, define what each subagent accepts and returns. Treat these like API contracts. Here’s an example for a content production pipeline:
from dataclasses import dataclass
from typing import Optional
@dataclass
class ResearchInput:
topic: str
depth: str # "shallow" | "deep"
max_sources: int = 5
@dataclass
class ResearchOutput:
summary: str
key_facts: list[str]
gaps: list[str] # things the research couldn't confirm
confidence: float # 0-1
@dataclass
class WritingInput:
research: ResearchOutput
tone: str
word_count: int
audience: str
@dataclass
class WritingOutput:
content: str
title: str
meta_description: str
Forcing typed I/O upfront prevents a huge class of orchestration bugs where one agent returns something the next one can’t parse. It also makes the system prompt for each subagent far simpler — you’re just telling it to fill a schema.
Step 2: Build Tight Subagent System Prompts
A subagent’s system prompt should be under 200 words. If you need more, the subagent is doing too much. Here’s an example for the research subagent:
import anthropic
import json
client = anthropic.Anthropic()
RESEARCH_SYSTEM_PROMPT = """
You are a research specialist. You receive a topic and produce structured research output.
Return ONLY valid JSON matching this schema:
{
"summary": "2-3 sentence overview",
"key_facts": ["fact 1", "fact 2", ...],
"gaps": ["thing you couldn't confirm", ...],
"confidence": 0.0-1.0
}
Rules:
- key_facts: 3-7 items, each under 50 words
- gaps: be honest, do not fabricate sources
- confidence: reflect how well-established the information is
- No prose outside the JSON block
"""
def run_research_agent(topic: str, depth: str = "shallow") -> dict:
model = "claude-3-haiku-20240307" if depth == "shallow" else "claude-3-5-sonnet-20241022"
response = client.messages.create(
model=model,
max_tokens=1024,
system=RESEARCH_SYSTEM_PROMPT,
messages=[{
"role": "user",
"content": f"Research topic: {topic}"
}]
)
# Extract JSON from response
raw = response.content[0].text
return json.loads(raw) # will raise if agent went off-script
Notice the model selection is inside the subagent function, not the orchestrator. The orchestrator passes depth as a parameter, but the subagent decides the model. This encapsulates cost decisions where they belong.
Step 3: Write the Orchestrator
The orchestrator’s job is to manage state, handle errors, and decide what to call next. It uses a more capable model because it’s doing planning and synthesis:
import asyncio
from typing import Optional
async def run_content_pipeline(
topic: str,
audience: str,
word_count: int = 800
) -> dict:
pipeline_state = {
"topic": topic,
"errors": [],
"research": None,
"content": None
}
# Step 1: Research (use shallow for speed, deep only if needed)
try:
research_result = run_research_agent(topic, depth="shallow")
# If confidence is low, re-run with deep research
if research_result["confidence"] < 0.6:
print(f"Low confidence ({research_result['confidence']}), escalating to deep research")
research_result = run_research_agent(topic, depth="deep")
pipeline_state["research"] = research_result
except json.JSONDecodeError as e:
pipeline_state["errors"].append(f"Research agent returned invalid JSON: {e}")
return pipeline_state # fail fast, don't proceed with bad data
# Step 2: Writing (depends on research output)
try:
writing_result = run_writing_agent(
research=research_result,
audience=audience,
tone="professional",
word_count=word_count
)
pipeline_state["content"] = writing_result
except Exception as e:
pipeline_state["errors"].append(f"Writing agent failed: {e}")
return pipeline_state
return pipeline_state
# Run it
result = asyncio.run(run_content_pipeline(
topic="vector databases in production",
audience="senior engineers",
word_count=1000
))
The fail-fast pattern on line 28 is important — don’t pass garbage forward. If the research agent returns malformed output, stop the pipeline and surface the error. Silent failures in multi-agent systems are brutal to debug.
Parallel Subagent Execution
When subagents are independent, run them concurrently. Here’s where the latency wins are:
import asyncio
import anthropic
client = anthropic.Anthropic()
async def run_agent_async(system_prompt: str, user_message: str, model: str) -> str:
"""Async wrapper around the synchronous Anthropic client."""
loop = asyncio.get_event_loop()
response = await loop.run_in_executor(
None, # uses default thread pool
lambda: client.messages.create(
model=model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
)
return response.content[0].text
async def run_parallel_analysis(document: str) -> dict:
"""Run sentiment, summary, and entity extraction in parallel."""
sentiment_task = run_agent_async(
system_prompt="Extract sentiment from text. Return JSON: {sentiment: positive|negative|neutral, score: 0-1}",
user_message=document,
model="claude-3-haiku-20240307"
)
summary_task = run_agent_async(
system_prompt="Summarize in 2 sentences. Return JSON: {summary: string}",
user_message=document,
model="claude-3-haiku-20240307"
)
entity_task = run_agent_async(
system_prompt="Extract named entities. Return JSON: {people: [], orgs: [], places: []}",
user_message=document,
model="claude-3-haiku-20240307"
)
# All three run concurrently — latency is max(t1, t2, t3), not t1+t2+t3
sentiment, summary, entities = await asyncio.gather(
sentiment_task, summary_task, entity_task
)
return {
"sentiment": json.loads(sentiment),
"summary": json.loads(summary),
"entities": json.loads(entities)
}
In practice, running three Haiku calls in parallel takes roughly 1.2–1.8 seconds. Running them serially takes 3–5 seconds. For latency-sensitive workflows, parallel execution is non-negotiable.
State Management Between Subagents
The orchestrator should own all state. Subagents should never need to call each other directly — that creates coupling that will bite you. Pass state explicitly:
class PipelineState:
def __init__(self, task_id: str):
self.task_id = task_id
self.steps_completed: list[str] = []
self.artifacts: dict = {} # named outputs from each subagent
self.errors: list[dict] = []
self.total_tokens_used: int = 0
def record_step(self, step_name: str, output: dict, tokens: int):
self.steps_completed.append(step_name)
self.artifacts[step_name] = output
self.total_tokens_used += tokens
def get_artifact(self, step_name: str) -> Optional[dict]:
return self.artifacts.get(step_name)
def estimated_cost_usd(self) -> float:
# Rough estimate, assumes mix of Haiku and Sonnet
return self.total_tokens_used * 0.000001 # ~$1 per million tokens blended
The estimated_cost_usd() method is something I add to every production pipeline. Cost visibility during development prevents surprises. Once you’re tracking tokens per step, you’ll find the expensive ones quickly — usually it’s the orchestrator itself, not the subagents.
What Actually Breaks in Production
Here’s where the documentation stops and experience starts:
JSON Parsing Failures
Even with explicit instructions, Claude will occasionally wrap JSON in a markdown code block (```json ... ```). Build a robust extractor:
import re
def extract_json(raw: str) -> dict:
"""Extract JSON from response that may contain markdown code blocks."""
# Try direct parse first
try:
return json.loads(raw.strip())
except json.JSONDecodeError:
pass
# Try extracting from code block
match = re.search(r'```(?:json)?\s*([\s\S]*?)```', raw)
if match:
try:
return json.loads(match.group(1).strip())
except json.JSONDecodeError:
pass
raise ValueError(f"Could not extract JSON from response: {raw[:200]}")
Token Budget Overruns
Set max_tokens aggressively for each subagent based on expected output size. A classification subagent should never get 2048 tokens — that’s a signal it went off-script. If it hits your limit, that’s an error condition, not a truncation.
Orchestrator Confusion
If you use Claude as your orchestrator and pass it too much subagent output, it starts treating subagent prose as instructions. Keep orchestrator context lean — pass summaries of subagent outputs, not full responses, when possible.
Cost Reality for This Architecture
A realistic 5-step pipeline using this pattern — one orchestrator call on Sonnet plus four subagent calls on Haiku — costs roughly:
- Orchestrator (Sonnet): ~2000 tokens × $3/M = $0.006
- Four Haiku subagents: ~4000 tokens each × $0.25/M = $0.004 total
- Total per run: ~$0.01
At scale, 10,000 runs per day is $100/day — manageable. The architectural overhead vs. a monolithic Sonnet call is minimal, and you get far better reliability and debuggability in return.
When to Use This Pattern (and When Not To)
Use hierarchical subagent orchestration when:
- Your task has 4+ distinct steps with different skill requirements
- You need parallel execution across independent subtasks
- You want to mix model tiers for cost optimization
- Debugging and observability matter (production systems)
Skip it when:
- The task is genuinely simple and fits in one well-crafted prompt
- Latency is critical and you can’t afford the orchestration overhead
- You’re prototyping and the extra complexity isn’t worth it yet
For solo founders building their first agent: start monolithic, profile where it breaks, then extract subagents only for the steps that consistently fail. Don’t design for the architecture — let the failures tell you what to decompose.
For teams building production pipelines: the investment in typed subagent contracts and explicit state management pays off within weeks. Claude subagents orchestration done right means you can swap out individual subagents (different models, different prompts) without touching the rest of the system. That’s the real payoff — a pipeline you can actually iterate on.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

