Building Claude Agents with Persistent Memory Across Sessions: Production Architecture

Q: How do I give each user a separate memory store without creating separate databases?

Filter every query by user_id — that's all the schema above does. You don't need separate tables or databases per user. PostgreSQL row-level security (RLS) is worth adding if multiple services share the same DB connection pool, so one service can't accidentally read another user's facts.

Q: Can I use this with the Anthropic Claude API without OpenAI embeddings?

Yes. Swap the embed_text function to use any embedding model — Cohere's embed-english-v3.0 is a good alternative at similar pricing. Just make sure to update the vector dimension in your fact_embeddings column (Cohere v3 outputs 1024 dimensions, not 1536) and never mix dimensions in the same column.

Q: How do I handle memory for anonymous users who later create an account?

Create a temporary user_id from a browser fingerprint or session token when the user is anonymous. When they register, run an UPDATE user_facts SET user_id = 'real_user_id' WHERE user_id = 'anon_id' migration. This preserves the memory without gaps. Store the anon-to-real mapping in a separate table so you can handle edge cases like multiple anon sessions merging into one account.

By the end of this tutorial, you’ll have a working Claude agent with Claude agent persistent memory — one that stores conversation history, extracts structured facts, and retrieves relevant context across completely separate sessions using PostgreSQL and pgvector. Not a toy demo: this is the architecture pattern I’d use in a production support bot or personal assistant.

Most tutorials show you how to pass a conversation_history list to the Claude API. That works within a single session. The moment your server restarts, your Lambda function cold-starts, or your user comes back tomorrow, the list is gone. True persistence requires external storage, smart retrieval, and a memory management strategy that doesn’t blow your context window — or your budget.

Install dependencies — Set up the Python environment with Anthropic SDK, psycopg2, and pgvector
Design the memory schema — Create PostgreSQL tables for raw messages, extracted facts, and embeddings
Build the memory writer — Extract and store structured facts after each turn
Build the memory retriever — Semantic search to pull relevant context before each API call
Wire the agent loop — Assemble the full conversational agent with session management
Add memory decay and limits — Prevent unbounded growth without losing important context

Step 1: Install Dependencies

You need Python 3.10+, a running PostgreSQL 15+ instance with the pgvector extension enabled, and an Anthropic API key. I’m using Claude Haiku for the memory extraction step (roughly $0.00025 per extraction call) and Claude Sonnet 3.5 for the main agent responses.

pip install anthropic psycopg2-binary pgvector openai python-dotenv pydantic

Enable pgvector in your Postgres instance:

CREATE EXTENSION IF NOT EXISTS vector;

If you’re using Supabase, pgvector comes pre-installed. For local dev, docker run -e POSTGRES_PASSWORD=postgres -p 5432:5432 pgvector/pgvector:pg16 gets you running in 30 seconds.

Step 2: Design the Memory Schema

The schema has three tables. Don’t collapse them into one — you’ll thank yourself when you need to query or expire them independently.

-- Raw conversation turns, retained for audit/replay
CREATE TABLE conversation_turns (
    id SERIAL PRIMARY KEY,
    session_id UUID NOT NULL,
    user_id TEXT NOT NULL,
    role TEXT NOT NULL CHECK (role IN ('user', 'assistant')),
    content TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Extracted structured facts about the user
CREATE TABLE user_facts (
    id SERIAL PRIMARY KEY,
    user_id TEXT NOT NULL,
    fact_type TEXT NOT NULL,  -- e.g. 'preference', 'goal', 'context'
    fact_text TEXT NOT NULL,
    confidence FLOAT DEFAULT 1.0,
    source_session UUID,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    last_accessed TIMESTAMPTZ DEFAULT NOW(),
    access_count INT DEFAULT 0
);

-- Semantic embeddings for fact retrieval
CREATE TABLE fact_embeddings (
    id SERIAL PRIMARY KEY,
    fact_id INT REFERENCES user_facts(id) ON DELETE CASCADE,
    embedding vector(1536),  -- OpenAI text-embedding-3-small dimensions
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Index for fast vector similarity search
CREATE INDEX ON fact_embeddings USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

CREATE INDEX ON conversation_turns (user_id, created_at DESC);
CREATE INDEX ON user_facts (user_id, fact_type);

The access_count and last_accessed columns on user_facts are how you implement memory decay later. Facts that never get retrieved are probably noise — you’ll expire them without losing anything the agent actually uses. For a deeper dive into semantic retrieval patterns here, see our guide on semantic search implementation for agent knowledge bases.

Step 3: Build the Memory Writer

After each assistant turn, run a cheap extraction call to pull structured facts from the conversation. This is where Claude Haiku earns its keep — you don’t need Sonnet for JSON extraction.

import anthropic
import json
import psycopg2
from pydantic import BaseModel
from typing import List, Optional
import uuid

client = anthropic.Anthropic()

class ExtractedFact(BaseModel):
    fact_type: str  # 'preference', 'goal', 'personal_context', 'technical_context'
    fact_text: str
    confidence: float  # 0.0-1.0

def extract_facts_from_turn(user_message: str, assistant_response: str) -> List[ExtractedFact]:
    """Use Haiku to extract durable facts from a single conversation turn."""
    extraction_prompt = f"""Extract durable facts about the user from this conversation turn.
Only extract facts that would still be relevant in a future session.
Return a JSON array. Each item: {{"fact_type": str, "fact_text": str, "confidence": float}}

Fact types: preference, goal, personal_context, technical_context, constraint

User: {user_message}
Assistant: {assistant_response}

Return ONLY valid JSON. Empty array if no durable facts found."""

    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": extraction_prompt}]
    )
    
    try:
        raw = response.content[0].text.strip()
        facts_data = json.loads(raw)
        return [ExtractedFact(**f) for f in facts_data]
    except (json.JSONDecodeError, KeyError):
        return []  # Fail silently — never block the main agent

def store_facts(conn, user_id: str, session_id: uuid.UUID, facts: List[ExtractedFact]):
    """Persist extracted facts, avoiding near-duplicate entries."""
    with conn.cursor() as cur:
        for fact in facts:
            if fact.confidence < 0.6:
                continue  # Skip low-confidence extractions
            
            # Simple deduplication: check if very similar fact already exists
            cur.execute("""
                SELECT id FROM user_facts 
                WHERE user_id = %s AND fact_type = %s 
                AND similarity(fact_text, %s) > 0.85
                LIMIT 1
            """, (user_id, fact.fact_type, fact.fact_text))
            
            if cur.fetchone():
                continue  # Skip near-duplicate
            
            cur.execute("""
                INSERT INTO user_facts (user_id, fact_type, fact_text, confidence, source_session)
                VALUES (%s, %s, %s, %s, %s)
                RETURNING id
            """, (user_id, fact.fact_type, fact.fact_text, fact.confidence, str(session_id)))
            
        conn.commit()

The similarity check requires the pg_trgm extension (CREATE EXTENSION pg_trgm;). Without deduplication, you’ll accumulate hundreds of near-identical facts within a week of real usage. This is one of those things the docs never mention.

Step 4: Build the Memory Retriever

Before each API call, pull the facts most relevant to the current user message. This is a straightforward cosine similarity search against your stored embeddings.

from openai import OpenAI  # Using OpenAI embeddings — cheaper than Claude for this

embed_client = OpenAI()

def embed_text(text: str) -> List[float]:
    """Generate embedding using text-embedding-3-small (~$0.00002 per call)."""
    response = embed_client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def retrieve_relevant_facts(conn, user_id: str, query: str, top_k: int = 8) -> List[dict]:
    """Semantic search over stored facts for this user."""
    query_embedding = embed_text(query)
    
    with conn.cursor() as cur:
        cur.execute("""
            SELECT uf.fact_type, uf.fact_text, uf.confidence,
                   1 - (fe.embedding <=> %s::vector) AS similarity
            FROM fact_embeddings fe
            JOIN user_facts uf ON uf.id = fe.fact_id
            WHERE uf.user_id = %s
            ORDER BY fe.embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding, user_id, query_embedding, top_k))
        
        results = cur.fetchall()
        
        # Update access tracking for decay calculation
        if results:
            fact_texts = [r[1] for r in results]
            cur.execute("""
                UPDATE user_facts 
                SET last_accessed = NOW(), access_count = access_count + 1
                WHERE user_id = %s AND fact_text = ANY(%s)
            """, (user_id, fact_texts))
            conn.commit()
        
        return [
            {"fact_type": r[0], "fact_text": r[1], "confidence": r[2], "similarity": r[3]}
            for r in results if r[3] > 0.65  # Filter out low-relevance matches
        ]

def get_recent_turns(conn, user_id: str, limit: int = 6) -> List[dict]:
    """Pull the most recent raw conversation turns for short-term context."""
    with conn.cursor() as cur:
        cur.execute("""
            SELECT role, content FROM conversation_turns
            WHERE user_id = %s
            ORDER BY created_at DESC
            LIMIT %s
        """, (user_id, limit))
        rows = cur.fetchall()
        return [{"role": r[0], "content": r[1]} for r in reversed(rows)]

Step 5: Wire the Agent Loop

Now assemble everything into a session-aware agent. The key is building a memory context block that gets injected into the system prompt — not the conversation history. This keeps it consistent across turns and is harder for the model to “forget”.

def build_memory_context(facts: List[dict]) -> str:
    """Format retrieved facts into a system prompt section."""
    if not facts:
        return ""
    
    grouped = {}
    for f in facts:
        grouped.setdefault(f["fact_type"], []).append(f["fact_text"])
    
    lines = ["## What you know about this user:"]
    for fact_type, items in grouped.items():
        lines.append(f"\n**{fact_type.replace('_', ' ').title()}:**")
        for item in items:
            lines.append(f"- {item}")
    
    return "\n".join(lines)

def run_agent_turn(conn, user_id: str, session_id: uuid.UUID, user_message: str) -> str:
    """Execute one full agent turn with memory read/write."""
    
    # 1. Retrieve relevant long-term memory
    relevant_facts = retrieve_relevant_facts(conn, user_id, user_message)
    memory_context = build_memory_context(relevant_facts)
    
    # 2. Get recent short-term context
    recent_turns = get_recent_turns(conn, user_id, limit=6)
    
    # 3. Build system prompt with memory injected
    system_prompt = f"""You are a helpful assistant with memory of past interactions.
Use the context below to personalize your responses. Don't explicitly reference
having a "memory system" — just respond naturally with awareness of past context.

{memory_context}"""
    
    # 4. Build messages: recent history + current message
    messages = recent_turns + [{"role": "user", "content": user_message}]
    
    # 5. Call Claude
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=system_prompt,
        messages=messages
    )
    assistant_response = response.content[0].text
    
    # 6. Persist the new turn
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO conversation_turns (session_id, user_id, role, content)
            VALUES (%s, %s, 'user', %s), (%s, %s, 'assistant', %s)
        """, (str(session_id), user_id, user_message,
              str(session_id), user_id, assistant_response))
        conn.commit()
    
    # 7. Async-safe: extract and store facts (run this in background in production)
    new_facts = extract_facts_from_turn(user_message, assistant_response)
    store_facts(conn, user_id, session_id, new_facts)
    
    return assistant_response

In production, run step 7 as a background task (Celery, FastAPI BackgroundTasks, or an async queue). Don’t block the response waiting for fact extraction — users notice latency above 300ms.

This architecture pairs well with the RAG pipeline patterns we covered if you want to also pull in external documents alongside user-specific memory.

Step 6: Add Memory Decay and Limits

Without expiry, your fact table grows indefinitely. Run this as a nightly cron job:

-- Delete facts never accessed after 60 days
DELETE FROM user_facts
WHERE last_accessed < NOW() - INTERVAL '60 days'
  AND access_count < 3;

-- Cap per-user facts at 500, keeping highest-confidence ones
DELETE FROM user_facts
WHERE id IN (
    SELECT id FROM (
        SELECT id, 
               ROW_NUMBER() OVER (
                   PARTITION BY user_id 
                   ORDER BY confidence DESC, access_count DESC
               ) AS rn
        FROM user_facts
    ) ranked
    WHERE rn > 500
);

The 500-fact cap and 60-day TTL are starting points. In a support bot context with high message volume, I’d drop these to 200 facts and 30 days. For a personal productivity agent with infrequent sessions, loosen them to 1000 facts and 180 days.

If you’re worried about runaway costs from extraction calls, check the LLM fallback and retry logic patterns we’ve covered — the same budget circuit breaker approach applies here.

Common Errors

pgvector dimension mismatch on insert

Error: ERROR: expected 1536 dimensions, not 3072

You changed embedding models mid-deployment (e.g., switched from text-embedding-3-small to text-embedding-3-large). The column definition locks the dimension. Fix: alter the column type or create a migration that drops and recreates the fact_embeddings table with the correct dimension. Never mix embedding models in the same column.

Fact extraction returns malformed JSON

Claude Haiku occasionally returns a JSON object with a trailing comment or a prose explanation before the array. The try/except in extract_facts_from_turn handles this silently, but you’re losing extractions. Fix: add a pre-processing step that strips everything before the first [ character. Also add "Return ONLY valid JSON, no explanation" to your extraction prompt — it helps more than you’d expect. For more on getting reliable structured output from Claude, see our structured outputs and verification patterns guide.

Context window overflow with many recent turns

If you’re using 6 recent turns and each turn is 800 tokens, that’s already 4,800 tokens before system prompt and new message. With Claude Sonnet’s 200K context this won’t hit the hard limit, but it will make responses slower and more expensive. Fix: summarize turns older than the last 3 using a Haiku call that produces a 2-3 sentence “conversation summary so far” block, and inject that instead of raw history. This keeps costs predictable as sessions grow.

What to Build Next

Add a memory review tool call — give the agent a list_my_memories tool that users can invoke to see what the agent has learned about them, with a corresponding delete_memory tool. This is both a good product feature and a GDPR compliance requirement if you’re operating in the EU. Combine it with the Claude tool use patterns we’ve documented for clean implementation.

The bottom line by reader type:

Solo founder building a SaaS product: Use this exact architecture with Supabase as your Postgres host. Free tier handles development, and you’ll spend roughly $3–8/month on the embedding API calls for a moderately active user base.
Team building an internal agent: Add a Redis layer for the session-level short-term memory (last 6 turns) to reduce DB reads on high-frequency users. PostgreSQL handles the long-term facts fine.
Enterprise deployment: Replace the OpenAI embeddings with Cohere or a self-hosted model to keep all data in your VPC. The pgvector schema doesn’t care what model generated the vectors as long as you’re consistent.

Frequently Asked Questions

How do I give each user a separate memory store without creating separate databases?

Filter every query by user_id — that’s all the schema above does. You don’t need separate tables or databases per user. PostgreSQL row-level security (RLS) is worth adding if multiple services share the same DB connection pool, so one service can’t accidentally read another user’s facts.

What’s the difference between short-term and long-term memory in this architecture?

Short-term memory is the raw recent conversation turns (last N messages from conversation_turns) — it’s verbatim and session-aware. Long-term memory is the extracted facts in user_facts — structured, deduplicated, and retrieved semantically. You need both: recent turns give conversational continuity, facts give cross-session personalization.

Can I use this with the Anthropic Claude API without OpenAI embeddings?

Yes. Swap the embed_text function to use any embedding model — Cohere’s embed-english-v3.0 is a good alternative at similar pricing. Just make sure to update the vector dimension in your fact_embeddings column (Cohere v3 outputs 1024 dimensions, not 1536) and never mix dimensions in the same column.

How much does running this architecture cost at scale?

At 1,000 conversation turns/day: Haiku extraction ~$0.25/day, OpenAI embeddings ~$0.02/day, Sonnet responses ~$3–15/day depending on response length. PostgreSQL on a $25/month VPS or Supabase free tier handles this load easily. The database is not the cost driver — the main agent model is.

How do I handle memory for anonymous users who later create an account?

Create a temporary user_id from a browser fingerprint or session token when the user is anonymous. When they register, run an UPDATE user_facts SET user_id = 'real_user_id' WHERE user_id = 'anon_id' migration. This preserves the memory without gaps. Store the anon-to-real mapping in a separate table so you can handle edge cases like multiple anon sessions merging into one account.

Put this into practice

Try the Architecture Modernizer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building Claude Agents with Persistent Memory Across Sessions: Production Architecture

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building Claude Agents with Persistent Memory Across Sessions: Production Architecture

Step 1: Install Dependencies

Step 2: Design the Memory Schema

Step 3: Build the Memory Writer

Step 4: Build the Memory Retriever

Step 5: Wire the Agent Loop

Step 6: Add Memory Decay and Limits

Common Errors

pgvector dimension mismatch on insert

Fact extraction returns malformed JSON

Context window overflow with many recent turns

What to Build Next

Frequently Asked Questions

How do I give each user a separate memory store without creating separate databases?

What’s the difference between short-term and long-term memory in this architecture?

Can I use this with the Anthropic Claude API without OpenAI embeddings?

How much does running this architecture cost at scale?

How do I handle memory for anonymous users who later create an account?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation