Building Claude agents that actually read your emails: production setup for automated triage

By the end of this tutorial, you’ll have a working Claude agents email automation pipeline that fetches emails via the Gmail API, classifies them with Claude Haiku, and either routes, drafts replies, or escalates — all without hallucinating sender intent or fabricating quoted context. This is the production-grade version, not a weekend demo.

Install dependencies — set up the Python environment with Anthropic SDK and Gmail API client
Configure Gmail OAuth — authenticate with service account or user OAuth flow
Fetch and parse email threads — pull messages and preserve thread context
Build the triage classifier — use Claude Haiku to categorize with structured JSON output
Draft conditional responses — have Claude Sonnet write replies only when confidence is high
Wire up the polling loop — run the agent on a schedule with deduplication

Why most email automation agents break in production

The naive approach — dump a raw email into Claude and ask “what should I do?” — fails within a week of real traffic. You hit three problems fast: thread context gets lost when you only pass the latest message, the model hallucinates sender names when the display name doesn’t match the From header, and you have no deduplication so every restart reprocesses the same inbox.

The setup below solves all three. It’s built around the Gmail API (not IMAP, which is flaky and increasingly restricted), uses Haiku for cheap classification (~$0.0002 per email at current pricing), and only invokes Sonnet for actual draft generation. A typical inbox of 200 emails/day costs roughly $0.15–$0.25 in API calls depending on thread length.

Step 1: Install dependencies

pip install anthropic google-auth google-auth-oauthlib google-api-python-client python-dotenv

Pin these in your requirements.txt. The Google client library is stable but the auth flow breaks if you mix versions of google-auth and google-auth-oauthlib.

Step 2: Configure Gmail OAuth

For a personal or single-user setup, OAuth with a desktop app credential is fine. For a team or SaaS context, use a service account with domain-wide delegation. The key difference: service accounts can impersonate any user in your Google Workspace domain without storing refresh tokens per user.

from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.discovery import build
import os, pickle

SCOPES = ['https://www.googleapis.com/auth/gmail.modify']

def get_gmail_service():
    creds = None
    # Token cache — avoids re-auth on every run
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as f:
            creds = pickle.load(f)

    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES
            )
            creds = flow.run_local_server(port=0)
        with open('token.pickle', 'wb') as f:
            pickle.dump(creds, f)

    return build('gmail', 'v1', credentials=creds)

Don’t commit token.pickle or credentials.json to git. Add both to .gitignore immediately. The gmail.modify scope lets you read, label, and mark emails — you don’t need gmail.send unless you’re auto-sending (more on why you probably shouldn’t auto-send below).

Step 3: Fetch and parse email threads

This is where most tutorials cut corners. You need the full thread, not just the latest message. Otherwise Claude has no context when a reply says “as I mentioned earlier.”

import base64
from email import message_from_bytes

def fetch_unread_threads(service, max_results=50):
    """Fetch unread threads, return list of (thread_id, messages)."""
    result = service.users().threads().list(
        userId='me',
        q='is:unread',
        maxResults=max_results
    ).execute()

    threads = []
    for thread_meta in result.get('threads', []):
        thread = service.users().threads().get(
            userId='me',
            id=thread_meta['id'],
            format='full'
        ).execute()
        threads.append(parse_thread(thread))
    return threads

def parse_thread(thread):
    """Extract clean text from all messages in a thread."""
    messages = []
    for msg in thread['messages']:
        headers = {h['name']: h['value'] for h in msg['payload']['headers']}
        body = extract_body(msg['payload'])
        messages.append({
            'message_id': msg['id'],
            'from': headers.get('From', ''),
            'subject': headers.get('Subject', ''),
            'date': headers.get('Date', ''),
            'body': body[:2000]  # truncate long emails to control token cost
        })
    return {
        'thread_id': thread['id'],
        'messages': messages
    }

def extract_body(payload):
    """Recursively extract text/plain from MIME parts."""
    if payload.get('mimeType') == 'text/plain':
        data = payload.get('body', {}).get('data', '')
        return base64.urlsafe_b64decode(data + '==').decode('utf-8', errors='replace')
    for part in payload.get('parts', []):
        result = extract_body(part)
        if result:
            return result
    return ''

The body[:2000] truncation is intentional — long marketing emails can be 15K tokens each. Truncating keeps costs predictable without losing signal. For replies, the meaningful content is almost always in the first 2,000 characters. If you need full content for contracts or legal email, remove the cap and track your costs carefully. LLM caching strategies can cut those costs significantly when threads repeat in subsequent runs.

Step 4: Build the triage classifier with Claude Haiku

Haiku handles classification reliably and costs a fraction of Sonnet. I use structured JSON output to avoid parsing ambiguity — if you’ve fought Claude into returning valid JSON before, read the guide on structured output without hallucinations first.

import anthropic
import json

client = anthropic.Anthropic()

TRIAGE_SYSTEM = """You are an email triage agent. Classify each email thread and return ONLY valid JSON.

Categories:
- "urgent_reply": needs a response within 2 hours (customer complaint, contract, outage)
- "standard_reply": needs a response but not urgent
- "no_reply": newsletter, notification, FYI — no action needed
- "escalate": legal, financial, or sensitive — flag for human review

Return format:
{"category": "<category>", "confidence": 0.0-1.0, "reason": "<one sentence>", "suggested_label": "<gmail_label>"}
"""

def classify_thread(thread):
    """Use Haiku to classify a thread. Returns dict."""
    # Format the thread for Claude — most recent message first
    thread_text = "\n---\n".join([
        f"From: {m['from']}\nDate: {m['date']}\nSubject: {m['subject']}\n\n{m['body']}"
        for m in reversed(thread['messages'])
    ])

    response = client.messages.create(
        model="claude-haiku-4-5",  # cheapest, fast enough for classification
        max_tokens=256,
        system=TRIAGE_SYSTEM,
        messages=[{"role": "user", "content": f"Classify this thread:\n\n{thread_text}"}]
    )

    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Fallback if Claude wraps JSON in markdown — shouldn't happen with this prompt
        import re
        match = re.search(r'\{.*\}', response.content[0].text, re.DOTALL)
        return json.loads(match.group()) if match else {"category": "escalate", "confidence": 0.0}

Step 5: Draft conditional responses with Claude Sonnet

Only invoke Sonnet (at roughly $0.003/1K input tokens) when confidence is above 0.8 and the category warrants a reply. Everything else gets labeled and skipped. This is the core economics of a viable Claude agents email automation workflow.

DRAFT_SYSTEM = """You are a professional email assistant. Write a reply that:
- Matches the tone of the original (formal if formal, casual if casual)
- Is concise — 3-5 sentences max unless detail is explicitly needed
- Never fabricates facts, names, or dates not present in the thread
- Ends with a clear next step or question
- Does NOT include a subject line — only the reply body
"""

def draft_reply(thread, classification):
    """Draft a reply only for high-confidence cases."""
    if classification['confidence'] < 0.8:
        return None
    if classification['category'] not in ('urgent_reply', 'standard_reply'):
        return None

    latest = thread['messages'][-1]
    context = "\n---\n".join([
        f"From: {m['from']}\n{m['body']}" for m in thread['messages']
    ])

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        system=DRAFT_SYSTEM,
        messages=[{
            "role": "user",
            "content": f"Write a reply to this email thread. The most recent message is at the bottom.\n\n{context}"
        }]
    )
    return response.content[0].text

Do not auto-send replies. Save drafts to Gmail’s Drafts folder and let a human approve before sending. Auto-sending breaks trust fast when the model misclassifies tone or context, and Gmail’s API makes saving drafts just as easy as sending. For teams that want to explore auto-reply in safer contexts like lead follow-up, see how we built a lead generation email agent with safeguards.

Step 6: Wire up the polling loop with deduplication

import time
import json
from pathlib import Path

PROCESSED_FILE = Path('processed_threads.json')

def load_processed():
    if PROCESSED_FILE.exists():
        return set(json.loads(PROCESSED_FILE.read_text()))
    return set()

def save_processed(processed):
    PROCESSED_FILE.write_text(json.dumps(list(processed)))

def run_triage_loop(interval_seconds=300):
    service = get_gmail_service()
    processed = load_processed()

    while True:
        print(f"Fetching unread threads...")
        threads = fetch_unread_threads(service)

        for thread in threads:
            tid = thread['thread_id']
            if tid in processed:
                continue  # skip already-handled threads

            classification = classify_thread(thread)
            print(f"Thread {tid}: {classification['category']} ({classification['confidence']:.2f})")

            # Apply Gmail label based on classification
            label = classification.get('suggested_label', 'INBOX')
            # (label application code omitted for brevity — use service.users().threads().modify())

            draft = draft_reply(thread, classification)
            if draft:
                print(f"Draft created for thread {tid}")
                # Save to Gmail Drafts via service.users().drafts().create()

            processed.add(tid)

        save_processed(processed)
        time.sleep(interval_seconds)

if __name__ == '__main__':
    run_triage_loop()

For production deployment, replace the while True loop with a scheduled job. If you’re running this on a server, cron-scheduled Claude agents is the simplest path. If you want event-driven processing (fire on new email, not on a timer), Gmail Push Notifications via Pub/Sub is the right architecture — but that’s a separate setup.

Common errors and how to fix them

Token refresh failures after 7 days

Google OAuth refresh tokens expire if the app is in “Testing” mode in the Google Cloud Console. Either publish the app (requires verification for sensitive scopes) or add your email as a test user. You’ll see invalid_grant errors. Fix: move to “In production” status or re-authenticate manually every 7 days (not practical). For service accounts, this doesn’t apply.

Claude returning markdown-wrapped JSON

Even with explicit instructions, Haiku occasionally wraps JSON in ```json fences when the email contains code snippets. The regex fallback in Step 4 handles this, but a more robust fix is to add "Return only raw JSON, no markdown formatting." as the last line of your system prompt. If you’re still fighting inconsistent output, the structured JSON output guide covers tool-use-based extraction which is more reliable than prompting alone.

Thread context overflow on long conversations

Long support threads can exceed 8–10K tokens when concatenated. Claude handles this fine at the context level, but costs spike. Set a hard limit: include only the last 5 messages in any thread. Add a note to the system prompt: “You are seeing a summary of a longer thread. Earlier messages have been omitted.” This prevents the model from hallucinating references to messages it wasn’t shown. If you need full thread memory, consider a proper memory strategy — implementing long-term memory without a database covers lightweight approaches that fit this use case.

What to build next

The logical extension is multi-step routing: instead of just classifying, have the agent check your CRM to see if the sender is a paying customer, pull their account tier, and adjust the reply tone and SLA accordingly. This turns a simple triage bot into something closer to a full customer support agent. The integration pattern — classify, lookup, draft with context — maps directly onto the customer support automation architecture we’ve documented with real production metrics.

Bottom line by reader type:

Solo founder: Run this as a cron job on a $5 VPS. Haiku + Sonnet hybrid keeps costs under $5/month for most inboxes. Start with read-only, no auto-send.
Small team: Add a shared Google Workspace service account, label emails by team member, and route drafts to individual Gmail drafts folders. Add a Slack notification when something is classified as escalate.
Enterprise / high-volume: Replace the polling loop with Gmail Push Notifications, add Pub/Sub, and deploy on a proper serverless platform. The decision framework in choosing the right serverless platform for Claude agents applies directly here.

Claude agents email automation done right is boring infrastructure — it runs quietly, saves hours per week, and never surprises you. The version above is deliberately conservative: classify, label, draft, wait for human approval. Once you trust the classification accuracy on your specific inbox (run it in read-only mode for a week and audit the labels), you can selectively enable auto-send for the highest-confidence, lowest-risk categories.

Frequently Asked Questions

Can I use this with Outlook or Microsoft 365 instead of Gmail?

Yes, but swap the Gmail API for Microsoft Graph API. The authentication flow uses Azure AD OAuth 2.0, and the message structure is similar — you’ll still fetch threads and extract body text. The Claude triage logic is identical; only the email client wrapper changes. Microsoft’s Graph SDK for Python is msgraph-sdk.

How do I prevent Claude from auto-replying to emails it shouldn’t touch?

Use the confidence threshold (0.8 in the example above) as your first gate, and add an explicit blocklist of domains or subjects in your triage prompt. Never auto-send to emails flagged as escalate. Running in draft-only mode for the first few weeks lets you audit false positives before enabling any send functionality.

What’s the actual cost to run this on a 200-email-per-day inbox?

At current Claude pricing: Haiku classification at ~500 tokens per email costs roughly $0.04/day for 200 emails. Sonnet drafts (assuming 30% of emails need replies, ~800 tokens each) add about $0.14/day. Total: under $0.20/day or ~$6/month. Costs scale linearly with volume and thread length.

How do I handle email threads where the context is split across many messages?

Cap thread inclusion at the last 5 messages and tell the model in the system prompt that earlier context has been omitted. For threads where older context is critical (contract negotiations, long support cases), consider summarizing older messages into a single “thread history” block rather than truncating them entirely. This keeps tokens manageable while preserving signal.

Is it safe to store Gmail OAuth tokens on a VPS?

It’s acceptable for personal use if you encrypt the token file at rest and restrict file permissions to the running user. For team or production deployments, store credentials in a secrets manager (AWS Secrets Manager, HashiCorp Vault, or even a .env loaded from a secure CI/CD pipeline). Never commit tokens to version control.

Put this into practice

Browse our directory of Claude Code agents — ready-to-use agents for development, automation, and data workflows.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building Claude agents that actually read your emails: production setup for automated triage

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building Claude agents that actually read your emails: production setup for automated triage

Why most email automation agents break in production

Step 1: Install dependencies

Step 2: Configure Gmail OAuth

Step 3: Fetch and parse email threads

Step 4: Build the triage classifier with Claude Haiku

Step 5: Draft conditional responses with Claude Sonnet

Step 6: Wire up the polling loop with deduplication

Common errors and how to fix them

Token refresh failures after 7 days

Claude returning markdown-wrapped JSON

Thread context overflow on long conversations

What to build next

Frequently Asked Questions

Can I use this with Outlook or Microsoft 365 instead of Gmail?

How do I prevent Claude from auto-replying to emails it shouldn’t touch?

What’s the actual cost to run this on a 200-email-per-day inbox?

How do I handle email threads where the context is split across many messages?

Is it safe to store Gmail OAuth tokens on a VPS?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation