Building a Production Email Agent with Claude: Read, Filter, and Auto-Reply at Scale

Most email automation tutorials show you how to auto-reply to a contact form. What they skip is everything that actually matters in production: threading context across a conversation, avoiding double-replies when Gmail delivers a message twice, handling the Portuguese customer who somehow ended up on your English-language list, and not burning $40/day on tokens because you’re classifying every newsletter as high-priority. Building a Claude email agent that handles real inbox complexity is a different problem than the demos suggest — and this guide covers it end to end.

By the end of this article you’ll have a working architecture for an email agent that reads incoming messages, classifies and filters them against configurable rules, drafts context-aware replies, and sends — with human-in-the-loop approval for anything the model isn’t confident about. I’ll include the actual code, the cost math, and the three edge cases that will absolutely bite you if you skip them.

What the Architecture Actually Looks Like

Before touching any code, get the component map clear. An email agent isn’t a single Claude call — it’s a pipeline with state. Here’s how I structure it:

Ingestion layer — polls IMAP or listens to a Gmail push webhook, deduplicates, and normalises raw email into a structured envelope.
Classification step — a fast, cheap model call (Haiku) decides the category and confidence score.
Context retrieval — looks up conversation history, CRM data, or any prior tickets linked to the sender.
Reply generation — a full Sonnet call drafts the response using retrieved context.
Approval gate — if confidence is below threshold, the draft goes to a Slack channel for human review before send.
Send + log — sends via SMTP or Gmail API, writes outcome to a database for future context retrieval.

This matters because you do not want to run every email through your most expensive model. Classification with Haiku costs roughly $0.0003 per email at current pricing. Running every message straight through Sonnet 3.5 for both classification and reply costs about 50× more. At 500 emails/day that’s the difference between ~$4.50/month and ~$225/month just for the AI layer.

Setting Up the Email Ingestion and Deduplication Layer

Gmail push notifications via Pub/Sub are faster than IMAP polling, but IMAP is simpler to start with and works with any provider. The deduplication step is non-negotiable — Gmail will occasionally deliver the same message twice, and your agent will reply twice to a customer if you’re not tracking message IDs.

import imaplib
import email
import hashlib
import sqlite3
from email.header import decode_header

def fetch_unseen_emails(host, user, password, folder="INBOX"):
    mail = imaplib.IMAP4_SSL(host)
    mail.login(user, password)
    mail.select(folder)

    # Search for unseen messages only
    _, message_ids = mail.search(None, "UNSEEN")
    emails = []

    for msg_id in message_ids[0].split():
        _, msg_data = mail.fetch(msg_id, "(RFC822)")
        raw = msg_data[0][1]
        msg = email.message_from_bytes(raw)

        envelope = {
            "message_id": msg.get("Message-ID", "").strip(),
            "thread_id": msg.get("In-Reply-To", "").strip(),
            "subject": _decode_header(msg.get("Subject", "")),
            "sender": msg.get("From", ""),
            "body": _extract_body(msg),
            "raw_hash": hashlib.sha256(raw).hexdigest(),
        }
        emails.append(envelope)

    mail.logout()
    return emails

def _decode_header(value):
    decoded, encoding = decode_header(value)[0]
    if isinstance(decoded, bytes):
        return decoded.decode(encoding or "utf-8", errors="replace")
    return decoded

def _extract_body(msg):
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == "text/plain":
                return part.get_payload(decode=True).decode("utf-8", errors="replace")
    return msg.get_payload(decode=True).decode("utf-8", errors="replace")

def is_duplicate(conn, raw_hash):
    """Check processed_emails table before doing anything with a message."""
    cur = conn.execute(
        "SELECT 1 FROM processed_emails WHERE raw_hash = ?", (raw_hash,)
    )
    return cur.fetchone() is not None

Store raw_hash and message_id in a processed_emails table as soon as you start processing, not after you send the reply. If your process crashes mid-run you want to retry the draft, not re-send an already-sent message.

The Classification Step: Fast and Cheap First

The classification call should be a single Haiku call with a structured output request. Don’t ask Claude to classify and draft in one call — you’ll pay Sonnet rates to throw away 80% of emails after classification anyway.

import anthropic
import json

client = anthropic.Anthropic()

CLASSIFY_PROMPT = """You are an email triage assistant. Classify the email below.

Return valid JSON only, no explanation:
{
  "category": one of ["support", "sales_inquiry", "billing", "spam", "auto_reply", "other"],
  "confidence": float 0.0-1.0,
  "priority": one of ["high", "normal", "low"],
  "language": ISO 639-1 code,
  "needs_reply": boolean,
  "reason": one sentence
}

Email:
From: {sender}
Subject: {subject}
Body: {body}"""

def classify_email(envelope):
    prompt = CLASSIFY_PROMPT.format(
        sender=envelope["sender"],
        subject=envelope["subject"],
        body=envelope["body"][:1500],  # cap tokens — full body rarely needed for classification
    )

    response = client.messages.create(
        model="claude-haiku-4-5",  # fast and cheap for classification
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}],
    )

    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Haiku occasionally adds a preamble — strip and retry
        text = response.content[0].text
        start = text.find("{")
        end = text.rfind("}") + 1
        return json.loads(text[start:end])

The needs_reply: false flag is your most important cost control. Auto-replies, internal notifications, marketing emails — once classified as no-reply, they’re done. In a typical SaaS support inbox, roughly 35-45% of inbound volume needs no reply at all. Skipping those saves real money and avoids accidentally engaging spam senders.

Building the Context-Aware Reply Generator

This is where Sonnet earns its cost premium. The reply generator gets the envelope, the classification result, and any relevant prior conversation history. Threading is the hard part — you need to look up prior messages by both message_id and sender address, because customers don’t always reply in-thread.

REPLY_PROMPT = """You are a helpful support agent for {company_name}.

CONVERSATION HISTORY (most recent last):
{history}

CURRENT EMAIL:
From: {sender}
Subject: {subject}
Body: {body}

CLASSIFICATION: {category} | Priority: {priority} | Language: {language}

Instructions:
- Reply in the same language as the customer's email
- Be concise — 3-5 sentences unless complexity demands more
- Do not make promises about timelines unless they appear in our policy below
- If you cannot resolve the issue with certainty, say so and escalate

COMPANY POLICY EXCERPT:
{policy_snippet}

Reply (plain text, no subject line):"""

def generate_reply(envelope, classification, history, policy_snippet, company_name):
    history_text = "\n\n".join(
        [f"[{h['direction']}] {h['body'][:500]}" for h in history[-5:]]
    ) or "No prior history."

    prompt = REPLY_PROMPT.format(
        company_name=company_name,
        history=history_text,
        sender=envelope["sender"],
        subject=envelope["subject"],
        body=envelope["body"][:2000],
        category=classification["category"],
        priority=classification["priority"],
        language=classification["language"],
        policy_snippet=policy_snippet,
    )

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )

    return {
        "draft": response.content[0].text.strip(),
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }

The Approval Gate: Don’t Skip This in Production

Any agent that sends email automatically without an approval option is a liability. Confidently wrong is worse than uncertain — and Claude at high temperature will occasionally produce drafts that are technically coherent but entirely wrong for the situation. The confidence score from your classification step is your signal.

CONFIDENCE_THRESHOLD = 0.85  # tune this for your inbox

def route_draft(envelope, classification, draft_result, slack_webhook_url):
    if not classification["needs_reply"]:
        return {"action": "skip", "reason": classification["reason"]}

    if classification["confidence"] >= CONFIDENCE_THRESHOLD:
        return {"action": "auto_send", "draft": draft_result["draft"]}

    # Post to Slack for human review
    payload = {
        "text": f"*Email needs review*\n"
                f"From: {envelope['sender']}\n"
                f"Subject: {envelope['subject']}\n"
                f"Confidence: {classification['confidence']:.0%}\n"
                f"*Draft reply:*\n```{draft_result['draft']}```\n"
                f"Approve: /approve {envelope['message_id']}\n"
                f"Edit: /edit {envelope['message_id']}"
    }
    import requests
    requests.post(slack_webhook_url, json=payload)
    return {"action": "pending_review", "draft": draft_result["draft"]}

Set the threshold conservatively at first. Start at 0.90, watch your review queue for a week, and lower it only once you’re confident in the pattern. I’ve seen teams flip this to 0.70 in week one and spend the next month explaining to customers why they got bizarre replies.

Edge Cases That Will Break You in Production

The Auto-Reply Loop

Your agent replies to an out-of-office. The out-of-office replies back. Your agent replies again. You’ve now created an infinite loop that will drain your token budget and annoy a human who reads their email on Monday morning. Fix: in classification, any email with headers Auto-Submitted: auto-replied or X-Autoreply, or subject prefixes like “Out of Office”, should be hard-coded to needs_reply: false before the model call. Don’t trust the model to catch this reliably.

HTML Email Bodies Polluting Context

Feeding raw HTML into your prompt wastes tokens and confuses the model. A 3KB HTML newsletter strips down to 300 tokens of actual text. Use html2text or BeautifulSoup to strip markup before any model call. This alone can cut your average input token count by 60%.

from bs4 import BeautifulSoup

def clean_body(raw_body):
    if "<html" in raw_body.lower() or "<div" in raw_body.lower():
        soup = BeautifulSoup(raw_body, "html.parser")
        return soup.get_text(separator="\n", strip=True)
    return raw_body

Character Encoding Failures

You will get emails in Windows-1252, ISO-8859-1, and things that claim to be UTF-8 but aren’t. The errors="replace" flag in the ingestion code above handles this gracefully — it inserts a replacement character rather than crashing. Log when this happens; a high replacement-character count from a specific sender is often a sign something upstream is broken.

Real Cost Numbers for a 500-Email-Per-Day Inbox

Based on current Anthropic pricing (verify before you build — these change):

Classification (Haiku): ~500 calls × ~800 input tokens × $0.80/MTok = ~$0.32/day
Reply generation (Sonnet 3.5): ~225 emails need replies × ~2,000 input + 400 output tokens × $3/$15 per MTok = ~$1.35 + $1.35 = ~$2.70/day
Total AI cost: roughly $3.00/day, ~$90/month for 500 emails/day

At 80% auto-send rate with 20% going to human review, this handles ~400 emails automatically per day. That’s the equivalent of one part-time support hire just for volume triage — and the agent works nights and weekends.

Who Should Build This vs. Buy It

Build this yourself if: you have a developer on staff, your inbox has company-specific workflows that off-the-shelf tools won’t accommodate, or you’re handling sensitive data that you don’t want flowing through a third-party SaaS. This stack runs entirely on your infrastructure with the exception of the Anthropic API call.

Buy or use a no-code tool if: you’re a solo founder without Python chops — tools like n8n with Claude integration or Zapier AI can get you 70% of the way there in an afternoon, though they’ll cap out at basic classification and templated replies. They won’t handle threaded context or custom confidence routing.

The Claude email agent architecture described here is production-viable. The key discipline is the two-model split (Haiku for classification, Sonnet for generation), the hard-coded auto-reply detection, and the approval gate. Skip any of those three and you’ll hit a painful incident within the first two weeks of live traffic. Keep them, and this runs reliably at scale.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building a Production Email Agent with Claude: Read, Filter, and Auto-Reply at Scale

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building a Production Email Agent with Claude: Read, Filter, and Auto-Reply at Scale

What the Architecture Actually Looks Like

Setting Up the Email Ingestion and Deduplication Layer

The Classification Step: Fast and Cheap First

Building the Context-Aware Reply Generator

The Approval Gate: Don’t Skip This in Production

Edge Cases That Will Break You in Production

The Auto-Reply Loop

HTML Email Bodies Polluting Context

Character Encoding Failures

Real Cost Numbers for a 500-Email-Per-Day Inbox

Who Should Build This vs. Buy It

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation