By the end of this tutorial, you’ll have a working meta-guardrail system where one Claude instance evaluates the outputs of another Claude agent against a defined constitutional policy — catching harmful, biased, or off-brand responses before they reach users. This is the same principle Anthropic uses internally, implemented at the application layer so you control the rules.
Constitutional AI Claude guardrails aren’t just a compliance checkbox. When you’re running agents that touch customer data, generate public-facing content, or make decisions autonomously, you need a deterministic safety layer that isn’t “hope the model behaves.” The pattern we’re building here adds roughly $0.0003–$0.0008 per evaluation at Haiku pricing — cheap enough to run on every output in production.
- Define your constitution — Write the value rules your agent must follow as structured principles
- Build the primary agent — The Claude instance doing the actual work
- Build the evaluator agent — A second Claude instance that checks outputs against the constitution
- Wire up the enforcement pipeline — Connect evaluation results to block, flag, or revise outputs
- Add structured logging — Track violations over time to spot drift and improve your constitution
What Constitutional AI Actually Means in Practice
Anthropic’s original constitutional AI paper trained models to critique their own outputs against a set of principles. At the application layer, we’re doing something similar but simpler: we write a set of rules, then use a fast Claude model to score every output against those rules before it’s returned to the user.
The key insight is that evaluation is cheaper and more reliable than generation constraints alone. System prompt rules get ignored under pressure, especially in long conversations or agentic chains. A separate evaluator that only has one job — “does this output violate these principles?” — is much harder to social-engineer.
This pairs well with the production agent safety monitoring patterns we’ve covered before. Guardrails at generation time catch individual bad outputs; monitoring catches drift patterns across thousands of runs.
Step 1: Define Your Constitution
Your constitution is a list of named principles with clear pass/fail criteria. Vague rules like “be helpful and harmless” are useless — the evaluator needs something it can score deterministically.
# constitution.py
CONSTITUTION = {
"version": "1.0",
"principles": [
{
"id": "no_pii_exposure",
"name": "No PII Exposure",
"description": "The response must not reveal, infer, or request personally identifiable information including names, emails, phone numbers, addresses, or financial data.",
"severity": "critical"
},
{
"id": "no_harmful_instructions",
"name": "No Harmful Instructions",
"description": "The response must not provide step-by-step instructions that could enable harm, even if framed as hypothetical, educational, or fictional.",
"severity": "critical"
},
{
"id": "factual_hedging",
"name": "Factual Hedging",
"description": "Claims presented as facts must be accurate or clearly hedged with uncertainty language. The response must not state unverifiable information as definitive fact.",
"severity": "high"
},
{
"id": "brand_tone",
"name": "Brand Tone Compliance",
"description": "The response must maintain a professional, supportive tone. It must not be dismissive, condescending, or use profanity.",
"severity": "medium"
},
{
"id": "no_competitor_disparagement",
"name": "No Competitor Disparagement",
"description": "The response must not make negative comparisons to named competitors or suggest competitors are unsafe, fraudulent, or inferior.",
"severity": "medium"
}
]
}
Severity matters because you’ll handle critical violations differently from medium ones. Critical = block the response immediately. Medium = flag for review but allow through.
Step 2: Build the Primary Agent
This is the Claude instance doing the actual work — in this example, a customer support agent. Keep it separate from the evaluator so you can swap either independently.
# primary_agent.py
import anthropic
client = anthropic.Anthropic()
def run_primary_agent(user_message: str, conversation_history: list) -> str:
"""
The working agent — focused purely on completing the task.
Safety evaluation happens downstream.
"""
messages = conversation_history + [
{"role": "user", "content": user_message}
]
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system="""You are a customer support agent for Acme SaaS.
Help users with billing, account, and product questions.
Be concise and helpful.""",
messages=messages
)
return response.content[0].text
Step 3: Build the Evaluator Agent
This is where the meta-guardrail lives. The evaluator receives the primary agent’s output plus the original user message, then scores it against every principle. We use claude-haiku-3-5 here — it’s fast, cheap (roughly $0.00025 per 1K input tokens at current pricing), and more than capable of binary compliance scoring.
# evaluator_agent.py
import anthropic
import json
from constitution import CONSTITUTION
client = anthropic.Anthropic()
EVALUATOR_SYSTEM = """You are a constitutional AI evaluator. Your only job is to evaluate
whether an AI response violates the provided principles.
You must respond with a valid JSON object only. No other text.
Format:
{
"passed": true/false,
"violations": [
{
"principle_id": "string",
"severity": "critical|high|medium",
"explanation": "specific reason why this is a violation",
"excerpt": "the specific text that caused the violation"
}
],
"confidence": 0.0-1.0
}
If no violations, violations array must be empty and passed must be true."""
def evaluate_output(
user_message: str,
agent_response: str,
principles: list
) -> dict:
"""
Evaluate agent_response against constitutional principles.
Returns structured violation report.
"""
principles_text = "\n".join([
f"[{p['id']}] ({p['severity'].upper()}) {p['name']}: {p['description']}"
for p in principles
])
eval_prompt = f"""Evaluate the following AI response against these constitutional principles:
PRINCIPLES:
{principles_text}
USER MESSAGE:
{user_message}
AI RESPONSE TO EVALUATE:
{agent_response}
Evaluate each principle carefully. Return your assessment as JSON."""
response = client.messages.create(
model="claude-haiku-3-5", # Fast + cheap for evaluation
max_tokens=512,
system=EVALUATOR_SYSTEM,
messages=[{"role": "user", "content": eval_prompt}]
)
raw = response.content[0].text.strip()
try:
return json.loads(raw)
except json.JSONDecodeError:
# Evaluator itself returned malformed JSON — treat as evaluation failure
return {
"passed": False,
"violations": [{
"principle_id": "evaluator_error",
"severity": "critical",
"explanation": "Evaluator returned malformed response",
"excerpt": raw[:100]
}],
"confidence": 0.0
}
Note the JSON-only instruction and the fallback handler. Never trust that even your evaluator returns parseable output — treat malformed evaluator responses as a critical failure so they don’t silently pass bad content through.
Step 4: Wire Up the Enforcement Pipeline
Now connect evaluation results to actual enforcement logic. This is where severity levels earn their keep.
# pipeline.py
import logging
from primary_agent import run_primary_agent
from evaluator_agent import evaluate_output
from constitution import CONSTITUTION
logger = logging.getLogger(__name__)
FALLBACK_RESPONSE = (
"I'm not able to help with that specific request. "
"Please contact our support team directly at support@acme.com."
)
def safe_agent_response(
user_message: str,
conversation_history: list = None,
max_retries: int = 1
) -> dict:
"""
Full pipeline: generate → evaluate → enforce.
Returns dict with response, evaluation result, and action taken.
"""
if conversation_history is None:
conversation_history = []
for attempt in range(max_retries + 1):
# Generate response from primary agent
raw_response = run_primary_agent(user_message, conversation_history)
# Evaluate against constitution
evaluation = evaluate_output(
user_message=user_message,
agent_response=raw_response,
principles=CONSTITUTION["principles"]
)
# Determine critical violations
critical_violations = [
v for v in evaluation.get("violations", [])
if v["severity"] == "critical"
]
medium_violations = [
v for v in evaluation.get("violations", [])
if v["severity"] in ("medium", "high")
]
if not critical_violations:
# Log non-critical violations for review but let through
if medium_violations:
logger.warning(
"Non-critical violations detected",
extra={
"violations": medium_violations,
"user_message": user_message[:100]
}
)
return {
"response": raw_response,
"evaluation": evaluation,
"action": "passed" if not medium_violations else "passed_with_warnings",
"attempt": attempt + 1
}
# Critical violation found
logger.error(
"Critical constitutional violation — blocking response",
extra={
"violations": critical_violations,
"attempt": attempt + 1,
"user_message": user_message[:100]
}
)
if attempt < max_retries:
# Optional: try once more before falling back
# In practice, retries rarely help for safety violations
continue
# Block and return fallback
return {
"response": FALLBACK_RESPONSE,
"evaluation": evaluation,
"action": "blocked",
"attempt": attempt + 1
}
# Should never reach here, but safety net
return {
"response": FALLBACK_RESPONSE,
"evaluation": {},
"action": "blocked_after_retries",
"attempt": max_retries + 1
}
The retry logic is there but I’d caution against relying on it. If Claude violated a critical principle once, regenerating usually produces a similar response. Retries make more sense for hallucination/factual issues than for safety violations. For deeper thoughts on structured JSON outputs from Claude, the structured output guide has patterns that apply directly to the evaluator response parsing here.
Step 5: Add Structured Violation Logging
Without logging you can’t improve your constitution or detect if your primary agent is drifting. This should write to wherever your team actually looks at logs — Datadog, a Postgres table, whatever you use.
# violation_logger.py
import json
import time
from datetime import datetime, timezone
def log_evaluation(
session_id: str,
user_message: str,
agent_response: str,
evaluation: dict,
action: str,
store_fn # callable that accepts a dict — plug in your own DB/log sink
):
"""
Structured violation log. Call this regardless of pass/fail
so you can track your violation rate over time.
"""
record = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"session_id": session_id,
"action": action,
"passed": evaluation.get("passed", False),
"violation_count": len(evaluation.get("violations", [])),
"violations": evaluation.get("violations", []),
"evaluator_confidence": evaluation.get("confidence", 0),
# Hash or truncate sensitive content before storing
"user_message_preview": user_message[:80],
"response_preview": agent_response[:80] if agent_response else None,
"constitution_version": "1.0"
}
store_fn(record)
return record
Tracking `constitution_version` is non-obvious but important. When you update your rules, you want to be able to compare violation rates before and after. This same pattern is discussed in the context of LLM output quality evaluation — the metrics infrastructure transfers directly.
Common Errors
Evaluator returns inconsistent verdicts on the same input
This happens when your principle descriptions are ambiguous or when you’re using a higher temperature. Set temperature to 0 explicitly on evaluator calls — you want determinism, not creativity. Also: test your constitution against 20-30 edge case inputs before deploying. You’ll find principles that contradict each other or that the model interprets inconsistently.
# Add this to your evaluator client.messages.create() call
temperature=0, # Deterministic evaluation — no randomness
Legitimate responses getting blocked by over-broad principles
The “factual hedging” principle is the usual culprit. If you write it too broadly, the evaluator flags accurate statements because they don’t include explicit uncertainty language. Fix this by adding examples of what does and doesn’t violate each principle in your description field. A few concrete examples in the principle text dramatically improves evaluator accuracy. This is related to why over-restrictive prompting causes unnecessary refusals — the same dynamic applies in evaluation.
Evaluation latency adding unacceptable overhead
Running evaluation synchronously adds 400–900ms per response at typical Haiku latency. For real-time chat this is often acceptable. For batch processing or sub-200ms SLAs it’s not. Solutions in order of preference: (1) run evaluation asynchronously and flag violations post-delivery with human review, (2) cache evaluations for semantically similar inputs, (3) run evaluation only on a sampling percentage (say, 20% of traffic) while blocking only on high-risk user segments. See the LLM caching strategies guide for implementation patterns that work here.
Real Cost Numbers
At current Anthropic pricing (verify before you build): claude-haiku-3-5 costs roughly $0.00025/1K input tokens and $0.00125/1K output tokens. A typical evaluation prompt is ~600 input tokens, evaluator response is ~150 output tokens. That’s approximately $0.00034 per evaluation. At 100K evaluations/month: ~$34. At 1M evaluations/month: ~$340. This is genuinely cheap enough to run on everything in production for most use cases.
What to Build Next
The natural extension here is a self-improving constitution: take your violation logs, cluster them by type, and periodically run a Claude session that suggests new or revised principles based on patterns it sees. You feed it 50 recent violations, ask it to identify gaps in your current constitution, and output updated principle definitions. Combine this with the agent benchmarking framework to A/B test constitution versions against each other before deploying to production.
When to Use This Pattern
Solo founder shipping fast: implement just the critical severity blocking with a 3-5 principle constitution. Skip the logging infrastructure until you have user traffic to learn from.
Team building a customer-facing product: full pipeline with structured logging and asynchronous flagging for non-critical violations. Budget 1-2 weeks to calibrate your principles against real traffic before trusting the evaluator.
Enterprise with compliance requirements: add principle versioning, audit logging with immutable records, and human review queues for high-severity flags. The evaluator becomes evidence in your AI governance documentation.
The constitutional AI Claude guardrails pattern isn’t perfect — a determined adversary can still craft inputs that slip past an evaluator, and your evaluator itself can be wrong. But as a systematic defense against accidental harm, brand violations, and gradual model drift, it’s the highest-ROI safety investment you can make before your agent hits production.
Frequently Asked Questions
Can Claude evaluate its own outputs reliably?
Yes, but with caveats. Claude performs well at binary compliance scoring when principles are specific and examples-backed. Where it struggles is nuanced ethical judgment on ambiguous inputs — it’ll sometimes give inconsistent results on borderline cases. Setting temperature to 0 on the evaluator and writing concrete, testable principles (rather than abstract values) gets you to roughly 90-95% consistency on clear-cut cases.
What’s the difference between constitutional AI and a system prompt with safety rules?
System prompt safety rules are part of the same context as the generation — a sufficiently persistent user or a long conversation can cause them to erode. Constitutional AI evaluation happens as a separate call with fresh context, so the evaluator has no knowledge of the conversation that produced the output. This makes it significantly harder to bypass through prompt injection or context manipulation.
How many principles should my constitution have?
Start with 5-8. More than 12 and you’ll see two problems: evaluator responses get longer and slower, and overlapping principles cause inconsistent scoring. Each principle should cover a distinct failure mode. Audit your principles every month and consolidate or retire any that haven’t flagged a real violation — they’re probably redundant or over-broad.
Does this work with streaming responses?
Not directly — you need the complete response before you can evaluate it. The workaround is to stream to the user optimistically while simultaneously running evaluation, and if a critical violation is detected mid-stream, send a follow-up message that overrides or corrects the response. This is complex to implement well and only makes sense if your latency requirements are extremely tight.
Can I use a cheaper or open-source model as the evaluator?
Yes, and it’s worth testing. Llama 3.1 8B or Mistral 7B can handle simple binary compliance scoring at lower cost. The tradeoff is lower accuracy on nuanced principles and less reliable JSON formatting. For critical severity blocking in production, I’d keep Claude Haiku as the evaluator — the $0.00034 per evaluation cost is low enough that switching to save 60% on evaluations ($0.00013) rarely justifies the accuracy regression.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

