Building Claude Agents That Read and Auto-Reply to Emails at Scale

By the end of this tutorial, you’ll have a working Claude email automation agent that connects to Gmail via the Gmail API, reads unread messages, classifies them, generates context-aware replies, and sends them automatically — with proper error handling and cost controls baked in from the start.

This isn’t a toy demo. The architecture here handles rate limits, tracks what’s already been replied to, and uses Claude Haiku for cheap classification before deciding whether to escalate to Sonnet for complex drafts. Running against a typical support inbox of 200–300 emails/day, this costs roughly $0.80–$1.50/day at current Anthropic pricing.

Install dependencies — Set up the Python environment and authenticate with Gmail API
Build the Gmail reader — Fetch and parse unread emails with threading support
Add Claude classification — Categorize emails cheaply with Haiku before acting
Generate context-aware replies — Use Sonnet selectively for drafting responses
Send and mark emails — Reply via Gmail API and track processed messages
Add retry and failure handling — Make it production-safe

Step 1: Install Dependencies and Configure Authentication

You need three things: the Google API client library for Gmail access, the Anthropic SDK, and a way to track processed emails. SQLite works fine for the latter — no infra overhead.

pip install anthropic google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client python-dotenv

Create a project in Google Cloud Console, enable the Gmail API, and download your credentials.json OAuth file. Then set up your environment:

# .env
ANTHROPIC_API_KEY=sk-ant-...
GMAIL_CREDENTIALS_PATH=credentials.json
GMAIL_TOKEN_PATH=token.json
REPLY_LABEL=auto-replied  # optional: label emails the agent has handled

# auth.py
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.discovery import build
import os

SCOPES = ['https://www.googleapis.com/auth/gmail.modify']

def get_gmail_service():
    creds = None
    token_path = os.getenv('GMAIL_TOKEN_PATH', 'token.json')
    
    if os.path.exists(token_path):
        creds = Credentials.from_authorized_user_file(token_path, SCOPES)
    
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                os.getenv('GMAIL_CREDENTIALS_PATH'), SCOPES
            )
            creds = flow.run_local_server(port=0)
        with open(token_path, 'w') as f:
            f.write(creds.to_json())
    
    return build('gmail', 'v1', credentials=creds)

Run this once interactively to generate your token. After that, the agent runs headlessly.

Step 2: Build the Gmail Reader

The tricky part here isn’t fetching emails — it’s parsing MIME correctly and grabbing the right thread context so Claude can see the conversation history, not just an isolated message.

# reader.py
import base64
import email
from email import policy

def fetch_unread_emails(service, max_results=50):
    """Fetch unread emails, excluding those already auto-replied."""
    result = service.users().messages().list(
        userId='me',
        q='is:unread -label:auto-replied',
        maxResults=max_results
    ).execute()
    
    messages = result.get('messages', [])
    emails = []
    
    for msg in messages:
        full_msg = service.users().messages().get(
            userId='me',
            id=msg['id'],
            format='full'
        ).execute()
        emails.append(parse_email(full_msg))
    
    return emails

def parse_email(msg):
    """Extract sender, subject, body, and thread ID from a Gmail message."""
    headers = {h['name']: h['value'] for h in msg['payload']['headers']}
    
    body = extract_body(msg['payload'])
    
    return {
        'id': msg['id'],
        'thread_id': msg['threadId'],
        'from': headers.get('From', ''),
        'subject': headers.get('Subject', ''),
        'body': body[:4000],  # truncate to avoid blowing context on spam
        'snippet': msg.get('snippet', '')
    }

def extract_body(payload):
    """Recursively extract plain text body from MIME payload."""
    if payload.get('body', {}).get('data'):
        return base64.urlsafe_b64decode(payload['body']['data']).decode('utf-8', errors='replace')
    
    for part in payload.get('parts', []):
        if part['mimeType'] == 'text/plain':
            data = part.get('body', {}).get('data', '')
            if data:
                return base64.urlsafe_b64decode(data).decode('utf-8', errors='replace')
        elif part.get('parts'):
            result = extract_body(part)
            if result:
                return result
    
    return ''

Step 3: Add Claude Classification (Cheap First Pass)

Don’t use Sonnet for everything — it’s overkill and expensive for classification. Run every email through Haiku first. At $0.25/MTok input, classifying 300 emails/day costs about $0.04. That’s noise.

The classification schema matters here. Vague categories produce vague routing decisions. Be specific about what your agent should and shouldn’t reply to. This is related to the broader point about writing system prompts that actually constrain agent behavior — sloppy prompts produce unpredictable auto-replies, which is a support nightmare.

# classifier.py
import anthropic
import json

client = anthropic.Anthropic()

CLASSIFICATION_PROMPT = """You are an email triage agent. Classify this email into exactly one category.

Categories:
- SALES_INQUIRY: prospect asking about pricing, features, or demos
- SUPPORT_REQUEST: existing customer with a technical problem
- BILLING: invoice, payment, or subscription question
- SPAM_OR_IRRELEVANT: newsletters, cold outreach, automated notifications
- NEEDS_HUMAN: complaints, legal, sensitive, or ambiguous situations

Respond with valid JSON only: {"category": "...", "confidence": 0.0-1.0, "reason": "one sentence"}

Email:
From: {from_addr}
Subject: {subject}
Body: {body}
"""

def classify_email(email_data):
    """Use Claude Haiku for cheap, fast classification."""
    prompt = CLASSIFICATION_PROMPT.format(
        from_addr=email_data['from'],
        subject=email_data['subject'],
        body=email_data['body'][:1500]  # Haiku context is cheap but still truncate
    )
    
    response = client.messages.create(
        model="claude-haiku-4-5",  # ~$0.25/MTok — use this for classification always
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return {"category": "NEEDS_HUMAN", "confidence": 0.0, "reason": "parse error"}

If confidence is below 0.75, route to NEEDS_HUMAN regardless of category. Don’t let a 60% confident SALES_INQUIRY send an auto-reply — the failure mode there is embarrassing.

Step 4: Generate Context-Aware Replies

For emails that pass the classification filter (SALES_INQUIRY, SUPPORT_REQUEST, BILLING with high confidence), use Claude Sonnet to draft a reply. The key is including your business context in the system prompt so it doesn’t hallucinate your pricing or policies — a point covered in depth in our guide on reducing LLM hallucinations in production systems.

# reply_generator.py
SYSTEM_PROMPT = """You are a customer support agent for Acme SaaS. 

Company facts (use ONLY these — never invent):
- Pricing: Starter $29/mo, Pro $99/mo, Enterprise custom
- Free trial: 14 days, no credit card required
- Support hours: Mon-Fri 9am-6pm EST
- Escalation email: support@acme.com

Rules:
- Keep replies under 200 words
- Never promise features that don't exist
- If you're unsure, say "I'll have a team member follow up" — don't guess
- Match the tone of the incoming email (formal/casual)
- Sign off as "The Acme Team" — never claim to be a human
"""

def generate_reply(email_data, category):
    """Generate a draft reply using Claude Sonnet."""
    
    user_prompt = f"""Draft a reply to this {category} email.

Original email:
From: {email_data['from']}
Subject: {email_data['subject']}
Body: {email_data['body']}

Write only the reply body — no subject line, no metadata."""

    response = client.messages.create(
        model="claude-sonnet-4-5",  # ~$3/MTok — only for actual reply drafting
        max_tokens=400,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_prompt}]
    )
    
    return response.content[0].text

Step 5: Send Replies and Mark Emails Processed

# sender.py
import base64
from email.mime.text import MIMEText

def send_reply(service, original_email, reply_body):
    """Send a reply in the same Gmail thread."""
    
    # Extract sender address from "Name <email>" format
    sender_addr = original_email['from']
    
    message = MIMEText(reply_body)
    message['to'] = sender_addr
    message['subject'] = f"Re: {original_email['subject']}"
    message['In-Reply-To'] = original_email['id']
    message['References'] = original_email['id']
    
    raw = base64.urlsafe_b64encode(message.as_bytes()).decode()
    
    service.users().messages().send(
        userId='me',
        body={
            'raw': raw,
            'threadId': original_email['thread_id']
        }
    ).execute()
    
    # Apply label and mark as read
    service.users().messages().modify(
        userId='me',
        id=original_email['id'],
        body={
            'addLabelIds': ['auto-replied'],  # create this label in Gmail first
            'removeLabelIds': ['UNREAD']
        }
    ).execute()

Step 6: Add Retry and Failure Handling

The Gmail API rate-limits at 250 quota units/second per user. If you’re processing a large backlog on startup, you’ll hit it. The fix is simple: exponential backoff with jitter. The pattern is the same one covered in our post on LLM fallback and retry logic for production systems.

# main.py
import time
import random
import logging
from auth import get_gmail_service
from reader import fetch_unread_emails
from classifier import classify_email
from reply_generator import generate_reply
from sender import send_reply

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

REPLY_CATEGORIES = {'SALES_INQUIRY', 'SUPPORT_REQUEST', 'BILLING'}
MIN_CONFIDENCE = 0.75

def with_backoff(fn, max_retries=3):
    """Simple exponential backoff with jitter."""
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            logger.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait:.1f}s")
            time.sleep(wait)

def process_inbox():
    service = get_gmail_service()
    emails = fetch_unread_emails(service, max_results=50)
    
    logger.info(f"Found {len(emails)} unread emails to process")
    
    for email_data in emails:
        try:
            # Step 1: Classify cheaply with Haiku
            classification = with_backoff(lambda: classify_email(email_data))
            category = classification['category']
            confidence = classification['confidence']
            
            logger.info(f"Email '{email_data['subject']}' → {category} ({confidence:.0%})")
            
            # Step 2: Skip if not auto-repliable or low confidence
            if category not in REPLY_CATEGORIES or confidence < MIN_CONFIDENCE:
                logger.info(f"Skipping — category={category}, confidence={confidence:.0%}")
                continue
            
            # Step 3: Generate reply with Sonnet
            reply = with_backoff(lambda: generate_reply(email_data, category))
            
            # Step 4: Send and label
            with_backoff(lambda: send_reply(service, email_data, reply))
            logger.info(f"Replied to: {email_data['from']}")
            
            # Respect Gmail rate limits
            time.sleep(0.5)
            
        except Exception as e:
            logger.error(f"Failed to process email {email_data['id']}: {e}")
            # Continue to next email — don't let one failure kill the batch
            continue

if __name__ == '__main__':
    process_inbox()

Run this on a cron job every 5–10 minutes. For Outlook instead of Gmail, swap in the Microsoft Graph API — the Claude logic stays identical, only the email fetch/send layer changes. If you want a no-code version of this pipeline, the orchestration layer maps cleanly onto n8n or Make workflows with HTTP nodes for the Gmail API calls.

Common Errors and How to Fix Them

Error 1: Gmail API 403 — “Insufficient Permission”

This almost always means your OAuth token was generated with the wrong scopes. Delete token.json and re-authenticate. Make sure your Cloud Console OAuth consent screen includes the Gmail modify scope, not just read. Also confirm the account running the script is added as a test user if the app is still in “Testing” mode.

Error 2: Claude Returning Non-JSON from Classifier

Claude occasionally wraps JSON in markdown code fences (```json ... ```) even when you tell it not to. Fix this by stripping fence characters before parsing:

import re

def safe_parse_json(text):
    # Strip markdown code fences if present
    text = re.sub(r'```(?:json)?\s*', '', text).strip()
    return json.loads(text)

Error 3: Reply Threading Broken (Replies Show as New Emails)

Gmail threading requires both the In-Reply-To and References headers to contain the original Message-ID header, not the Gmail message ID. Fetch the full message headers and extract the Message-ID value separately from the Gmail API’s internal id field. These are different things and the Gmail docs don’t make this obvious.

# In parse_email(), also extract:
'message_id': headers.get('Message-ID', '')

# In send_reply(), use:
message['In-Reply-To'] = original_email['message_id']
message['References'] = original_email['message_id']

What to Build Next

Add a RAG layer for support replies. Right now, your system prompt hardcodes company facts. The next step is embedding your help docs into a vector store and doing a similarity search on each incoming email to inject the 3 most relevant support articles into Claude’s context before it drafts the reply. This cuts hallucinated policy responses dramatically and makes the replies genuinely useful. The architecture is the same pattern described in our RAG pipeline implementation guide — you’d be connecting it as a retrieval step before the generate_reply() call.

You could also extend this to add a human-approval queue: instead of sending automatically, route all drafted replies to a Slack message with Approve/Reject buttons. Start there if you’re nervous about autonomous sending — it gives you 2–3 weeks of data to tune confidence thresholds before going fully automated.

Bottom Line: Who Should Deploy This

Solo founders and small teams handling 50–200 emails/day: this setup pays for itself immediately. Run it on a $5 VPS or a free-tier cloud function. Total API cost at that volume is under $20/month.

Teams with compliance requirements (legal, healthcare, finance): don’t auto-send. Use this as a draft-generation layer that feeds into your existing helpdesk (Zendesk, Intercom). The Claude email automation agent still saves 70% of the time — it just requires a human to hit send.

High-volume operations (1000+ emails/day): look at batching API calls and consider whether Claude Haiku alone handles your reply categories. Many support use cases don’t need Sonnet if your system prompt is tight. Run a 500-email evaluation before committing to a model tier.

Frequently Asked Questions

Can I use this Claude email automation agent with Outlook instead of Gmail?

Yes. Replace the Gmail API calls with Microsoft Graph API — the endpoints for reading messages and sending replies are different, but the Claude classification and reply generation logic is completely unchanged. Microsoft Graph uses OAuth 2.0 the same way Gmail does, so the auth pattern maps directly. The main difference is that Graph API uses /me/messages and /me/sendMail instead of Gmail’s message resource structure.

What does it cost to run this against a typical support inbox?

For 200 emails/day: Haiku classification costs roughly $0.04/day (assuming ~500 tokens per email × 200 × $0.25/MTok). Sonnet replies (assuming 40% of emails get auto-replied, 600 tokens each) cost roughly $0.14/day. Total: under $0.20/day or about $6/month. At 1,000 emails/day, budget $1–1.50/day depending on your reply rate.

How do I prevent the agent from auto-replying to emails it shouldn’t touch?

Two layers of protection: first, the confidence threshold (don’t reply below 0.75 confidence), and second, an explicit NEEDS_HUMAN category in your classifier that catches complaints, legal threats, and ambiguous cases. You should also maintain a blocklist of sender domains that should never receive auto-replies (your own company domain, known legal firms, etc.). Start with a dry-run mode that logs intended replies without sending, and audit 100 emails before enabling live sending.

What’s the difference between using Claude Haiku vs Sonnet for email replies?

Haiku is roughly 12× cheaper than Sonnet and fast enough for classification (typically under 1 second), but produces noticeably flatter, less nuanced reply drafts. For simple FAQ-style responses (pricing questions, feature questions), Haiku replies are often good enough. For anything requiring empathy, complex troubleshooting, or handling an upset customer, Sonnet produces substantially better output. The two-tier approach in this tutorial — Haiku for classification, Sonnet for drafting — is the practical sweet spot.

Can I run this as a no-code workflow in n8n or Make?

Mostly yes. Both n8n and Make have Gmail nodes and HTTP request nodes for the Anthropic API. The classification and reply generation steps map to HTTP nodes with JSON parsing. The main limitation is that complex retry logic and custom MIME parsing are harder to implement cleanly in a visual workflow builder. A hybrid approach works well: n8n or Make handles the Gmail trigger and routing, a small Python function handles the Claude calls and MIME construction.

Put this into practice

Browse our directory of Claude Code agents — ready-to-use agents for development, automation, and data workflows.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building Claude Agents That Read and Auto-Reply to Emails at Scale

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building Claude Agents That Read and Auto-Reply to Emails at Scale

Step 1: Install Dependencies and Configure Authentication

Step 2: Build the Gmail Reader

Step 3: Add Claude Classification (Cheap First Pass)

Step 4: Generate Context-Aware Replies

Step 5: Send Replies and Mark Emails Processed

Step 6: Add Retry and Failure Handling

Common Errors and How to Fix Them

Error 1: Gmail API 403 — “Insufficient Permission”

Error 2: Claude Returning Non-JSON from Classifier

Error 3: Reply Threading Broken (Replies Show as New Emails)

What to Build Next

Bottom Line: Who Should Deploy This

Frequently Asked Questions

Can I use this Claude email automation agent with Outlook instead of Gmail?

What does it cost to run this against a typical support inbox?

How do I prevent the agent from auto-replying to emails it shouldn’t touch?

What’s the difference between using Claude Haiku vs Sonnet for email replies?

Can I run this as a no-code workflow in n8n or Make?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation