By the end of this tutorial, you’ll have a working Claude agents email automation pipeline that fetches emails via the Gmail API, classifies them with Claude Haiku, and either routes, drafts replies, or escalates — all without hallucinating sender intent or fabricating quoted context. This is the production-grade version, not a weekend demo.
- Install dependencies — set up the Python environment with Anthropic SDK and Gmail API client
- Configure Gmail OAuth — authenticate with service account or user OAuth flow
- Fetch and parse email threads — pull messages and preserve thread context
- Build the triage classifier — use Claude Haiku to categorize with structured JSON output
- Draft conditional responses — have Claude Sonnet write replies only when confidence is high
- Wire up the polling loop — run the agent on a schedule with deduplication
Why most email automation agents break in production
The naive approach — dump a raw email into Claude and ask “what should I do?” — fails within a week of real traffic. You hit three problems fast: thread context gets lost when you only pass the latest message, the model hallucinates sender names when the display name doesn’t match the From header, and you have no deduplication so every restart reprocesses the same inbox.
The setup below solves all three. It’s built around the Gmail API (not IMAP, which is flaky and increasingly restricted), uses Haiku for cheap classification (~$0.0002 per email at current pricing), and only invokes Sonnet for actual draft generation. A typical inbox of 200 emails/day costs roughly $0.15–$0.25 in API calls depending on thread length.
Step 1: Install dependencies
pip install anthropic google-auth google-auth-oauthlib google-api-python-client python-dotenv
Pin these in your requirements.txt. The Google client library is stable but the auth flow breaks if you mix versions of google-auth and google-auth-oauthlib.
Step 2: Configure Gmail OAuth
For a personal or single-user setup, OAuth with a desktop app credential is fine. For a team or SaaS context, use a service account with domain-wide delegation. The key difference: service accounts can impersonate any user in your Google Workspace domain without storing refresh tokens per user.
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.discovery import build
import os, pickle
SCOPES = ['https://www.googleapis.com/auth/gmail.modify']
def get_gmail_service():
creds = None
# Token cache — avoids re-auth on every run
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as f:
creds = pickle.load(f)
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
'credentials.json', SCOPES
)
creds = flow.run_local_server(port=0)
with open('token.pickle', 'wb') as f:
pickle.dump(creds, f)
return build('gmail', 'v1', credentials=creds)
Don’t commit token.pickle or credentials.json to git. Add both to .gitignore immediately. The gmail.modify scope lets you read, label, and mark emails — you don’t need gmail.send unless you’re auto-sending (more on why you probably shouldn’t auto-send below).
Step 3: Fetch and parse email threads
This is where most tutorials cut corners. You need the full thread, not just the latest message. Otherwise Claude has no context when a reply says “as I mentioned earlier.”
import base64
from email import message_from_bytes
def fetch_unread_threads(service, max_results=50):
"""Fetch unread threads, return list of (thread_id, messages)."""
result = service.users().threads().list(
userId='me',
q='is:unread',
maxResults=max_results
).execute()
threads = []
for thread_meta in result.get('threads', []):
thread = service.users().threads().get(
userId='me',
id=thread_meta['id'],
format='full'
).execute()
threads.append(parse_thread(thread))
return threads
def parse_thread(thread):
"""Extract clean text from all messages in a thread."""
messages = []
for msg in thread['messages']:
headers = {h['name']: h['value'] for h in msg['payload']['headers']}
body = extract_body(msg['payload'])
messages.append({
'message_id': msg['id'],
'from': headers.get('From', ''),
'subject': headers.get('Subject', ''),
'date': headers.get('Date', ''),
'body': body[:2000] # truncate long emails to control token cost
})
return {
'thread_id': thread['id'],
'messages': messages
}
def extract_body(payload):
"""Recursively extract text/plain from MIME parts."""
if payload.get('mimeType') == 'text/plain':
data = payload.get('body', {}).get('data', '')
return base64.urlsafe_b64decode(data + '==').decode('utf-8', errors='replace')
for part in payload.get('parts', []):
result = extract_body(part)
if result:
return result
return ''
The body[:2000] truncation is intentional — long marketing emails can be 15K tokens each. Truncating keeps costs predictable without losing signal. For replies, the meaningful content is almost always in the first 2,000 characters. If you need full content for contracts or legal email, remove the cap and track your costs carefully. LLM caching strategies can cut those costs significantly when threads repeat in subsequent runs.
Step 4: Build the triage classifier with Claude Haiku
Haiku handles classification reliably and costs a fraction of Sonnet. I use structured JSON output to avoid parsing ambiguity — if you’ve fought Claude into returning valid JSON before, read the guide on structured output without hallucinations first.
import anthropic
import json
client = anthropic.Anthropic()
TRIAGE_SYSTEM = """You are an email triage agent. Classify each email thread and return ONLY valid JSON.
Categories:
- "urgent_reply": needs a response within 2 hours (customer complaint, contract, outage)
- "standard_reply": needs a response but not urgent
- "no_reply": newsletter, notification, FYI — no action needed
- "escalate": legal, financial, or sensitive — flag for human review
Return format:
{"category": "<category>", "confidence": 0.0-1.0, "reason": "<one sentence>", "suggested_label": "<gmail_label>"}
"""
def classify_thread(thread):
"""Use Haiku to classify a thread. Returns dict."""
# Format the thread for Claude — most recent message first
thread_text = "\n---\n".join([
f"From: {m['from']}\nDate: {m['date']}\nSubject: {m['subject']}\n\n{m['body']}"
for m in reversed(thread['messages'])
])
response = client.messages.create(
model="claude-haiku-4-5", # cheapest, fast enough for classification
max_tokens=256,
system=TRIAGE_SYSTEM,
messages=[{"role": "user", "content": f"Classify this thread:\n\n{thread_text}"}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
# Fallback if Claude wraps JSON in markdown — shouldn't happen with this prompt
import re
match = re.search(r'\{.*\}', response.content[0].text, re.DOTALL)
return json.loads(match.group()) if match else {"category": "escalate", "confidence": 0.0}
Step 5: Draft conditional responses with Claude Sonnet
Only invoke Sonnet (at roughly $0.003/1K input tokens) when confidence is above 0.8 and the category warrants a reply. Everything else gets labeled and skipped. This is the core economics of a viable Claude agents email automation workflow.
DRAFT_SYSTEM = """You are a professional email assistant. Write a reply that:
- Matches the tone of the original (formal if formal, casual if casual)
- Is concise — 3-5 sentences max unless detail is explicitly needed
- Never fabricates facts, names, or dates not present in the thread
- Ends with a clear next step or question
- Does NOT include a subject line — only the reply body
"""
def draft_reply(thread, classification):
"""Draft a reply only for high-confidence cases."""
if classification['confidence'] < 0.8:
return None
if classification['category'] not in ('urgent_reply', 'standard_reply'):
return None
latest = thread['messages'][-1]
context = "\n---\n".join([
f"From: {m['from']}\n{m['body']}" for m in thread['messages']
])
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
system=DRAFT_SYSTEM,
messages=[{
"role": "user",
"content": f"Write a reply to this email thread. The most recent message is at the bottom.\n\n{context}"
}]
)
return response.content[0].text
Do not auto-send replies. Save drafts to Gmail’s Drafts folder and let a human approve before sending. Auto-sending breaks trust fast when the model misclassifies tone or context, and Gmail’s API makes saving drafts just as easy as sending. For teams that want to explore auto-reply in safer contexts like lead follow-up, see how we built a lead generation email agent with safeguards.
Step 6: Wire up the polling loop with deduplication
import time
import json
from pathlib import Path
PROCESSED_FILE = Path('processed_threads.json')
def load_processed():
if PROCESSED_FILE.exists():
return set(json.loads(PROCESSED_FILE.read_text()))
return set()
def save_processed(processed):
PROCESSED_FILE.write_text(json.dumps(list(processed)))
def run_triage_loop(interval_seconds=300):
service = get_gmail_service()
processed = load_processed()
while True:
print(f"Fetching unread threads...")
threads = fetch_unread_threads(service)
for thread in threads:
tid = thread['thread_id']
if tid in processed:
continue # skip already-handled threads
classification = classify_thread(thread)
print(f"Thread {tid}: {classification['category']} ({classification['confidence']:.2f})")
# Apply Gmail label based on classification
label = classification.get('suggested_label', 'INBOX')
# (label application code omitted for brevity — use service.users().threads().modify())
draft = draft_reply(thread, classification)
if draft:
print(f"Draft created for thread {tid}")
# Save to Gmail Drafts via service.users().drafts().create()
processed.add(tid)
save_processed(processed)
time.sleep(interval_seconds)
if __name__ == '__main__':
run_triage_loop()
For production deployment, replace the while True loop with a scheduled job. If you’re running this on a server, cron-scheduled Claude agents is the simplest path. If you want event-driven processing (fire on new email, not on a timer), Gmail Push Notifications via Pub/Sub is the right architecture — but that’s a separate setup.
Common errors and how to fix them
Token refresh failures after 7 days
Google OAuth refresh tokens expire if the app is in “Testing” mode in the Google Cloud Console. Either publish the app (requires verification for sensitive scopes) or add your email as a test user. You’ll see invalid_grant errors. Fix: move to “In production” status or re-authenticate manually every 7 days (not practical). For service accounts, this doesn’t apply.
Claude returning markdown-wrapped JSON
Even with explicit instructions, Haiku occasionally wraps JSON in ```json fences when the email contains code snippets. The regex fallback in Step 4 handles this, but a more robust fix is to add "Return only raw JSON, no markdown formatting." as the last line of your system prompt. If you’re still fighting inconsistent output, the structured JSON output guide covers tool-use-based extraction which is more reliable than prompting alone.
Thread context overflow on long conversations
Long support threads can exceed 8–10K tokens when concatenated. Claude handles this fine at the context level, but costs spike. Set a hard limit: include only the last 5 messages in any thread. Add a note to the system prompt: “You are seeing a summary of a longer thread. Earlier messages have been omitted.” This prevents the model from hallucinating references to messages it wasn’t shown. If you need full thread memory, consider a proper memory strategy — implementing long-term memory without a database covers lightweight approaches that fit this use case.
What to build next
The logical extension is multi-step routing: instead of just classifying, have the agent check your CRM to see if the sender is a paying customer, pull their account tier, and adjust the reply tone and SLA accordingly. This turns a simple triage bot into something closer to a full customer support agent. The integration pattern — classify, lookup, draft with context — maps directly onto the customer support automation architecture we’ve documented with real production metrics.
Bottom line by reader type:
- Solo founder: Run this as a cron job on a $5 VPS. Haiku + Sonnet hybrid keeps costs under $5/month for most inboxes. Start with read-only, no auto-send.
- Small team: Add a shared Google Workspace service account, label emails by team member, and route drafts to individual Gmail drafts folders. Add a Slack notification when something is classified as
escalate. - Enterprise / high-volume: Replace the polling loop with Gmail Push Notifications, add Pub/Sub, and deploy on a proper serverless platform. The decision framework in choosing the right serverless platform for Claude agents applies directly here.
Claude agents email automation done right is boring infrastructure — it runs quietly, saves hours per week, and never surprises you. The version above is deliberately conservative: classify, label, draft, wait for human approval. Once you trust the classification accuracy on your specific inbox (run it in read-only mode for a week and audit the labels), you can selectively enable auto-send for the highest-confidence, lowest-risk categories.
Frequently Asked Questions
Can I use this with Outlook or Microsoft 365 instead of Gmail?
Yes, but swap the Gmail API for Microsoft Graph API. The authentication flow uses Azure AD OAuth 2.0, and the message structure is similar — you’ll still fetch threads and extract body text. The Claude triage logic is identical; only the email client wrapper changes. Microsoft’s Graph SDK for Python is msgraph-sdk.
How do I prevent Claude from auto-replying to emails it shouldn’t touch?
Use the confidence threshold (0.8 in the example above) as your first gate, and add an explicit blocklist of domains or subjects in your triage prompt. Never auto-send to emails flagged as escalate. Running in draft-only mode for the first few weeks lets you audit false positives before enabling any send functionality.
What’s the actual cost to run this on a 200-email-per-day inbox?
At current Claude pricing: Haiku classification at ~500 tokens per email costs roughly $0.04/day for 200 emails. Sonnet drafts (assuming 30% of emails need replies, ~800 tokens each) add about $0.14/day. Total: under $0.20/day or ~$6/month. Costs scale linearly with volume and thread length.
How do I handle email threads where the context is split across many messages?
Cap thread inclusion at the last 5 messages and tell the model in the system prompt that earlier context has been omitted. For threads where older context is critical (contract negotiations, long support cases), consider summarizing older messages into a single “thread history” block rather than truncating them entirely. This keeps tokens manageable while preserving signal.
Is it safe to store Gmail OAuth tokens on a VPS?
It’s acceptable for personal use if you encrypt the token file at rest and restrict file permissions to the running user. For team or production deployments, store credentials in a secrets manager (AWS Secrets Manager, HashiCorp Vault, or even a .env loaded from a secure CI/CD pipeline). Never commit tokens to version control.
Put this into practice
Browse our directory of Claude Code agents — ready-to-use agents for development, automation, and data workflows.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

