Building a Claude Agent That Reads and Auto-Replies to Emails at Scale

Most email automation tutorials show you how to forward messages to a webhook and call it a day. That’s not what this is. By the end of this article, you’ll have a working Claude email agent automation pipeline that reads incoming customer emails, classifies intent, generates contextually appropriate replies, and only escalates to a human when it genuinely can’t handle something — all with proper error handling and costs that won’t surprise you at the end of the month.

I’ve run variations of this in production for SaaS support queues and e-commerce inboxes. The architecture I’m describing here handles roughly 80–90% of routine emails automatically, depending on your domain. Here’s how to build it properly.

What You’re Actually Building

The agent has four responsibilities: read emails from a mailbox, classify them by intent, generate a reply using Claude, and decide whether to send automatically or route to a human. That last part is where most tutorials fail — they skip the fallback logic entirely, which means your agent will confidently send a garbage response to an angry enterprise customer at 2am.

The stack I’ll use: Python, the Gmail API (or any IMAP source), Anthropic’s Python SDK, and a simple SQLite state store to track thread history. You can swap Gmail for Outlook or Postmark with minimal changes. The Claude model is claude-3-haiku-20240307 for classification (fast, cheap) and claude-3-5-sonnet-20241022 for reply generation (better reasoning). This two-model pattern is something I’d strongly recommend over using one model for everything — more on the cost breakdown below.

Setting Up the Email Reader

First, get emails into your pipeline. Using Gmail’s API with a service account is the most reliable method for production. OAuth tokens expire; service accounts don’t.


import anthropic
import sqlite3
import json
from googleapiclient.discovery import build
from google.oauth2 import service_account
import base64
import email

SCOPES = ['https://www.googleapis.com/auth/gmail.modify']

def get_gmail_service(service_account_file: str, delegated_email: str):
    """Create a Gmail service using a service account with domain delegation."""
    creds = service_account.Credentials.from_service_account_file(
        service_account_file,
        scopes=SCOPES
    ).with_subject(delegated_email)  # Delegate to the mailbox you want to read
    return build('gmail', 'v1', credentials=creds)

def fetch_unread_emails(service, max_results: int = 50) -> list[dict]:
    """Pull unread emails from the inbox. Returns parsed message dicts."""
    results = service.users().messages().list(
        userId='me',
        q='is:unread in:inbox',
        maxResults=max_results
    ).execute()

    messages = results.get('messages', [])
    parsed = []

    for msg in messages:
        raw = service.users().messages().get(
            userId='me',
            id=msg['id'],
            format='full'
        ).execute()

        # Extract headers and body
        headers = {h['name']: h['value'] for h in raw['payload']['headers']}
        body = extract_body(raw['payload'])

        parsed.append({
            'id': msg['id'],
            'thread_id': raw['threadId'],
            'from': headers.get('From', ''),
            'subject': headers.get('Subject', ''),
            'body': body,
            'snippet': raw.get('snippet', '')
        })

    return parsed

def extract_body(payload: dict) -> str:
    """Recursively extract plain text body from Gmail payload parts."""
    if payload.get('mimeType') == 'text/plain':
        data = payload.get('body', {}).get('data', '')
        return base64.urlsafe_b64decode(data).decode('utf-8', errors='replace')

    for part in payload.get('parts', []):
        result = extract_body(part)
        if result:
            return result

    return ''

One thing the Gmail API docs undersell: format='full' returns the complete MIME tree. You’ll hit multipart emails constantly in the wild, which is why extract_body recurses through parts rather than just grabbing the top-level body.

Classifying Intent With Claude Haiku

Before you write a reply, you need to know what kind of email you’re dealing with. This is the step that makes the whole system reliable. Classification is fast and cheap with Haiku — roughly $0.00025 per email at current input/output pricing.


client = anthropic.Anthropic()

INTENT_CATEGORIES = [
    "billing_question",
    "technical_support",
    "feature_request",
    "cancellation",
    "positive_feedback",
    "abuse_or_legal",
    "other"
]

def classify_email(email_data: dict) -> dict:
    """
    Returns a dict with 'intent', 'confidence', and 'should_escalate'.
    Uses Haiku for speed and cost — this runs on every single email.
    """
    prompt = f"""Classify this customer email into exactly one category.

Categories: {', '.join(INTENT_CATEGORIES)}

Email subject: {email_data['subject']}
Email body: {email_data['body'][:1500]}

Respond with JSON only, no prose. Format:
{{"intent": "<category>", "confidence": <0.0-1.0>, "summary": "<one sentence>"}}"""

    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )

    try:
        result = json.loads(response.content[0].text)
        # Escalate automatically for anything legal/abuse, or low confidence
        result['should_escalate'] = (
            result['intent'] == 'abuse_or_legal' or
            result['confidence'] < 0.7
        )
        return result
    except json.JSONDecodeError:
        # Parsing failed — safest thing is to escalate
        return {
            'intent': 'other',
            'confidence': 0.0,
            'summary': 'Classification failed',
            'should_escalate': True
        }

The should_escalate flag is doing important work here. Anything under 0.7 confidence gets flagged for human review. In my experience, that threshold catches most of the genuinely ambiguous emails while keeping the automation rate high. You can tune it — but don’t go above 0.85 or you’ll escalate too aggressively.

Generating Replies With Context

This is where Sonnet earns its keep. The reply generator needs to see the classification result, your knowledge base snippets (or system context about your product), and ideally the thread history if this is a follow-up email.


def get_thread_history(db_conn: sqlite3.Connection, thread_id: str) -> list[dict]:
    """Retrieve past messages from this email thread for context."""
    cursor = db_conn.execute(
        "SELECT role, content FROM thread_messages WHERE thread_id = ? ORDER BY created_at ASC",
        (thread_id,)
    )
    return [{"role": row[0], "content": row[1]} for row in cursor.fetchall()]

def generate_reply(
    email_data: dict,
    classification: dict,
    system_context: str,
    db_conn: sqlite3.Connection
) -> dict:
    """
    Generates a reply using Sonnet. Returns dict with 'reply', 'send_automatically', 'reason'.
    System context should include your product FAQ, tone guidelines, etc.
    """
    thread_history = get_thread_history(db_conn, email_data['thread_id'])

    system_prompt = f"""You are a support agent for {system_context}.

Your job is to write a helpful, concise reply to customer emails.
- Match the customer's tone (professional if they're formal, friendly if they're casual)
- Never promise features that don't exist
- If you cannot fully resolve the issue, say so honestly and tell them what the next step is
- Sign off as 'The Support Team' unless told otherwise
- Keep replies under 200 words unless the complexity demands more

This email was classified as: {classification['intent']}
Summary: {classification['summary']}"""

    # Build message history for the API call
    messages = thread_history + [
        {
            "role": "user",
            "content": f"Subject: {email_data['subject']}\n\nFrom: {email_data['from']}\n\n{email_data['body']}"
        }
    ]

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        system=system_prompt,
        messages=messages
    )

    reply_text = response.content[0].text

    # Second pass: sanity check whether to auto-send
    # Certain intents should always get human review before sending
    auto_send_blocked = classification['intent'] in ['cancellation', 'billing_question', 'abuse_or_legal']

    return {
        'reply': reply_text,
        'send_automatically': not auto_send_blocked and not classification['should_escalate'],
        'reason': 'blocked intent' if auto_send_blocked else 'auto-send approved'
    }

Why Thread History Matters

Without thread context, your agent will re-ask for information the customer already provided in a previous message. That’s the fastest way to get a frustrated reply. Storing previous exchanges in SQLite and feeding them back into the conversation window costs almost nothing in tokens but makes the replies dramatically more coherent on follow-ups.

The State Store and Send Logic

You need a lightweight state layer so you don’t process the same email twice, and to track what was sent, when, and whether a human reviewed it.


def init_db(db_path: str = "email_agent.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS processed_emails (
            email_id TEXT PRIMARY KEY,
            thread_id TEXT,
            intent TEXT,
            reply_sent TEXT,
            auto_sent INTEGER,
            processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS thread_messages (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            thread_id TEXT,
            role TEXT,
            content TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    return conn

def send_reply(service, original_email: dict, reply_text: str) -> bool:
    """Send a reply via Gmail API, maintaining thread context."""
    import email.mime.text
    message = email.mime.text.MIMEText(reply_text)
    message['to'] = original_email['from']
    message['subject'] = f"Re: {original_email['subject']}"
    message['threadId'] = original_email['thread_id']

    raw = base64.urlsafe_b64encode(message.as_bytes()).decode()

    try:
        service.users().messages().send(
            userId='me',
            body={'raw': raw, 'threadId': original_email['thread_id']}
        ).execute()
        return True
    except Exception as e:
        print(f"Send failed for {original_email['id']}: {e}")
        return False

Putting the Pipeline Together


def run_agent(
    service_account_file: str,
    delegated_email: str,
    system_context: str,
    dry_run: bool = False  # Set True to test without actually sending
):
    service = get_gmail_service(service_account_file, delegated_email)
    db = init_db()

    emails = fetch_unread_emails(service)
    print(f"Processing {len(emails)} unread emails...")

    for email_data in emails:
        # Skip if already processed
        existing = db.execute(
            "SELECT 1 FROM processed_emails WHERE email_id = ?",
            (email_data['id'],)
        ).fetchone()
        if existing:
            continue

        classification = classify_email(email_data)
        print(f"  [{email_data['id']}] Intent: {classification['intent']} | Escalate: {classification['should_escalate']}")

        if classification['should_escalate']:
            # Tag in Gmail for human review, log it, move on
            service.users().messages().modify(
                userId='me',
                id=email_data['id'],
                body={'addLabelIds': ['Label_NEEDS_REVIEW']}
            ).execute()
            db.execute(
                "INSERT INTO processed_emails (email_id, thread_id, intent, auto_sent) VALUES (?, ?, ?, 0)",
                (email_data['id'], email_data['thread_id'], classification['intent'])
            )
            db.commit()
            continue

        reply_result = generate_reply(email_data, classification, system_context, db)

        if reply_result['send_automatically'] and not dry_run:
            sent = send_reply(service, email_data, reply_result['reply'])
            if sent:
                # Store the outgoing message in thread history
                db.execute(
                    "INSERT INTO thread_messages (thread_id, role, content) VALUES (?, ?, ?)",
                    (email_data['thread_id'], 'assistant', reply_result['reply'])
                )

        db.execute(
            "INSERT INTO processed_emails (email_id, thread_id, intent, reply_sent, auto_sent) VALUES (?, ?, ?, ?, ?)",
            (email_data['id'], email_data['thread_id'], classification['intent'],
             reply_result['reply'], 1 if reply_result['send_automatically'] else 0)
        )
        db.commit()

        # Mark as read in Gmail
        service.users().messages().modify(
            userId='me',
            id=email_data['id'],
            body={'removeLabelIds': ['UNREAD']}
        ).execute()

Real Cost Breakdown at Scale

Let’s be specific. For a queue of 1,000 emails per day:

Classification (Haiku): ~500 input tokens + 50 output tokens per email = roughly $0.28/day
Reply generation (Sonnet): ~1,000 input + 300 output per email, but only ~70% reach this stage = roughly $4.20/day
Total: approximately $4.50/day for 1,000 emails, or $135/month

That math changes significantly if you’re dealing with long email threads or large attachments you’re summarizing. For most SaaS support queues, $135/month to handle 80%+ of emails automatically is extremely defensible. The two-model pattern cuts costs by about 40% compared to running everything through Sonnet.

What Breaks in Production

A few things you’ll hit that the documentation won’t warn you about:

HTML emails: The extract_body function above handles plain text. Many emails only contain HTML. You’ll need to add an html.parser fallback or use BeautifulSoup to strip tags before sending to Claude — otherwise you’re sending raw HTML as context and the model gets confused.
Rate limits: Gmail API has per-user quotas (250 quota units per second). If you’re processing a large backlog, add a time.sleep(0.1) between requests or you’ll hit 429s.
Claude context window on long threads: A 200-message thread will blow past your context budget. Cap thread history at the last 10 exchanges or summarize older messages before including them.
Haiku JSON hallucination: Occasionally Haiku wraps JSON in a markdown code block (```json). Add a strip step before parsing: text.strip().strip('`').replace('json\n', '')

Who Should Run This vs. Who Shouldn’t

Good fit: Solo founders with 50–500 emails/day who can’t hire support yet. E-commerce stores handling order status and return requests. SaaS products where 80% of support is the same 10 questions.

Bad fit: Regulated industries (healthcare, finance) where you need a documented human review step before any automated response goes out. High-stakes B2B accounts where a slightly off-tone automated reply could damage a $50k relationship. In those cases, use the agent for drafting, not sending — it still saves enormous time.

The pattern for the latter: set send_automatically=False for all emails, push the draft into Gmail drafts using the API, and let your team review and hit send. You still get 70% of the time savings with zero risk of an autonomous reply causing damage.

Running a full Claude email agent automation in production is genuinely within reach for a solo developer in a weekend. The architecture here — classify cheap, generate smart, escalate anything uncertain — is the same pattern I’d use whether you’re handling 100 emails a day or 10,000. Start in dry-run mode, review the first 200 outputs manually, then flip the auto-send flag once you trust the classifier on your specific email mix.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building a Claude Agent That Reads and Auto-Replies to Emails at Scale

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building a Claude Agent That Reads and Auto-Replies to Emails at Scale

What You’re Actually Building

Setting Up the Email Reader

Classifying Intent With Claude Haiku

Generating Replies With Context

Why Thread History Matters

The State Store and Send Logic

Putting the Pipeline Together

Real Cost Breakdown at Scale

What Breaks in Production

Who Should Run This vs. Who Shouldn’t

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation