Building Claude agents with persistent memory across sessions: production architecture guide

Q: How do I give Claude agents persistent memory without a vector database?

SQLite is the simplest path — store memories in a local database and retrieve them by user ID and type before each API call. You lose semantic search but gain zero infrastructure overhead. There's a detailed guide on implementing Claude agent memory without a database if you want an even lighter approach using flat files.

Q: What's the best embedding model for agent memory retrieval?

all-MiniLM-L6-v2 is the practical default — 384 dimensions, runs on CPU at ~14ms, free to use locally. If you need higher quality retrieval for domain-specific content, text-embedding-3-small from OpenAI costs ~$0.02 per million tokens and outperforms MiniLM on most benchmarks. Avoid over-engineering: for <50k memories, model quality matters less than retrieval consistency.

Q: How many memories should I inject per Claude API call?

3–7 is the practical sweet spot. More than 10 starts degrading response quality as Claude has to reconcile potentially conflicting or irrelevant context. Keep each memory under 100 tokens at write time, and always rank by relevance score before injecting — not by recency.

Q: Can I use Redis alone for persistent memory, or is it too volatile?

Redis with persistence enabled (AOF or RDB snapshots) is reliable enough for many production workloads, but it's not designed for long-term semantic retrieval. Use Redis for session-scoped context (last 20 turns) and as a fast cache layer, with a durable store (SQLite or a vector DB) as the source of truth for long-term memory.

Q: How do I prevent my Claude agent from injecting outdated or wrong memories?

Add a TTL to low-importance memories and a deduplication check at write time. For critical facts (user's company name, product preferences), include an update mechanism — if the new memory contradicts an existing one above a similarity threshold, overwrite rather than append. Explicit user corrections ("actually I prefer X now") should trigger a targeted delete of the conflicting entry.

By the end of this tutorial, you’ll have a working memory layer for your Claude agents that persists across sessions — using three different backends depending on your scale and budget. We’ll cover vector database retrieval, SQLite for single-server deployments, and Redis for low-latency lookups, with real code and honest tradeoffs for each.

Claude agents persistent memory is the single most common gap between demo-quality chatbots and production agents. Claude’s context window resets on every API call. If your agent needs to remember that a user prefers metric units, closed a deal last Tuesday, or previously asked about topic X, you have to build that memory layer yourself. This tutorial shows you exactly how.

Install dependencies — Set up the Python environment with anthropic, sentence-transformers, and your chosen memory backend
Define the memory schema — Design a consistent structure for storing and tagging memories
Implement vector DB memory (Qdrant) — Semantic retrieval for fuzzy, meaning-based recall
Implement SQLite memory — Structured, queryable memory for single-server agents
Implement Redis memory — Fast key-value memory for session-scoped and recent-turn recall
Wire memory into the Claude API call — Inject retrieved context into the system prompt
Add memory write-back — Extract and store new facts after each turn

Step 1: Install Dependencies

You need four things: the Anthropic SDK, a sentence embedding model, and at least one storage backend. I’m using sentence-transformers for embeddings because it runs locally (no extra API cost) and all-MiniLM-L6-v2 is fast enough for production at ~14ms per embed on CPU.

pip install anthropic sentence-transformers qdrant-client redis sqlite-utils numpy

If you’re cost-sensitive, skip qdrant-client and just use SQLite or Redis. If you need semantic search (“what did the user say about pricing last month?”), you need a vector store. The Pinecone vs Weaviate vs Qdrant comparison on this site is worth reading before you commit to one — Qdrant wins for self-hosted and Pinecone for managed, in my testing.

Step 2: Define the Memory Schema

Consistency here saves you pain later. Every memory entry should carry: content, a user/session ID, a timestamp, optional tags (e.g. “preference”, “fact”, “task”), and an embedding vector if you’re doing semantic retrieval.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional, List

@dataclass
class MemoryEntry:
    user_id: str
    content: str                        # The actual memory text
    memory_type: str                    # "preference", "fact", "task", "summary"
    created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    session_id: Optional[str] = None
    tags: List[str] = field(default_factory=list)
    importance: float = 0.5             # 0-1, used for pruning old memories
    embedding: Optional[List[float]] = None

The importance field matters in production. Without it, your memory store fills with noise and you’ll start injecting irrelevant facts into prompts. Score higher for explicit user statements (“I always want JSON output”), lower for transient context (“user seemed rushed today”).

Step 3: Implement Vector DB Memory (Qdrant)

Use this when you need semantic retrieval — finding memories by meaning, not exact match. Best for long-running agents with many users and diverse stored facts.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from sentence_transformers import SentenceTransformer
import uuid

class VectorMemoryStore:
    def __init__(self, host="localhost", port=6333, collection="agent_memory"):
        self.client = QdrantClient(host=host, port=port)
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
        self.collection = collection
        self._ensure_collection()

    def _ensure_collection(self):
        existing = [c.name for c in self.client.get_collections().collections]
        if self.collection not in existing:
            self.client.create_collection(
                collection_name=self.collection,
                vectors_config=VectorParams(size=384, distance=Distance.COSINE),
            )

    def store(self, entry: MemoryEntry) -> str:
        embedding = self.encoder.encode(entry.content).tolist()
        point_id = str(uuid.uuid4())
        self.client.upsert(
            collection_name=self.collection,
            points=[PointStruct(
                id=point_id,
                vector=embedding,
                payload={
                    "user_id": entry.user_id,
                    "content": entry.content,
                    "memory_type": entry.memory_type,
                    "created_at": entry.created_at,
                    "importance": entry.importance,
                    "tags": entry.tags,
                }
            )]
        )
        return point_id

    def retrieve(self, user_id: str, query: str, top_k: int = 5) -> List[dict]:
        query_vec = self.encoder.encode(query).tolist()
        results = self.client.search(
            collection_name=self.collection,
            query_vector=query_vec,
            query_filter={"must": [{"key": "user_id", "match": {"value": user_id}}]},
            limit=top_k,
            with_payload=True,
        )
        return [
            {"content": r.payload["content"], "score": r.score, "type": r.payload["memory_type"]}
            for r in results
        ]

Cost reality check: Qdrant on a $6/mo DigitalOcean droplet handles ~100k vectors easily. Embedding with MiniLM is free (local CPU). The only API cost is Claude itself. Compare this to naively re-sending full conversation history — on Claude Haiku at $0.25/M input tokens, a 10-turn history of 500 tokens each costs ~$0.00125 per call. Multiply by 10,000 calls/day and you’re at $12.50/day just for context replay. Memory retrieval cuts this by 60-80% in practice.

Step 4: Implement SQLite Memory

If you’re running a single-server agent with structured queries — “get all preferences for user X” or “list unfinished tasks” — SQLite is simpler and zero-infrastructure. No separate process, no network hops.

import sqlite3
import json
from datetime import datetime

class SQLiteMemoryStore:
    def __init__(self, db_path="agent_memory.db"):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self._setup()

    def _setup(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS memories (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                user_id TEXT NOT NULL,
                content TEXT NOT NULL,
                memory_type TEXT NOT NULL,
                importance REAL DEFAULT 0.5,
                tags TEXT DEFAULT '[]',
                created_at TEXT NOT NULL
            )
        """)
        self.conn.execute("CREATE INDEX IF NOT EXISTS idx_user ON memories(user_id)")
        self.conn.commit()

    def store(self, entry: MemoryEntry):
        self.conn.execute(
            "INSERT INTO memories (user_id, content, memory_type, importance, tags, created_at) VALUES (?,?,?,?,?,?)",
            (entry.user_id, entry.content, entry.memory_type, entry.importance,
             json.dumps(entry.tags), entry.created_at)
        )
        self.conn.commit()

    def retrieve(self, user_id: str, memory_type: str = None, limit: int = 10) -> List[dict]:
        if memory_type:
            rows = self.conn.execute(
                "SELECT content, memory_type, importance FROM memories WHERE user_id=? AND memory_type=? ORDER BY importance DESC, created_at DESC LIMIT ?",
                (user_id, memory_type, limit)
            ).fetchall()
        else:
            rows = self.conn.execute(
                "SELECT content, memory_type, importance FROM memories WHERE user_id=? ORDER BY importance DESC, created_at DESC LIMIT ?",
                (user_id, limit)
            ).fetchall()
        return [{"content": r[0], "type": r[1], "importance": r[2]} for r in rows]

    def prune(self, user_id: str, keep_top: int = 100):
        """Keep only the top-N most important memories per user"""
        self.conn.execute("""
            DELETE FROM memories WHERE user_id=? AND id NOT IN (
                SELECT id FROM memories WHERE user_id=? ORDER BY importance DESC LIMIT ?
            )
        """, (user_id, user_id, keep_top))
        self.conn.commit()

SQLite’s limitation is no semantic search — you get exact/filtered retrieval only. For many agents (customer support bots, preference-tracking assistants), this is perfectly fine. For a customer support agent at scale, I’d actually start here before adding vector search complexity.

Step 5: Implement Redis Memory

Redis is the right choice when you need sub-millisecond retrieval for recent session context or when you’re running multiple agent instances behind a load balancer and need shared state. Use it as a fast cache layer on top of SQLite or Qdrant, not as your primary long-term store.

import redis
import json
from datetime import datetime

class RedisMemoryStore:
    def __init__(self, host="localhost", port=6379, db=0, ttl_seconds=86400):
        self.r = redis.Redis(host=host, port=port, db=db, decode_responses=True)
        self.ttl = ttl_seconds  # Default: memories expire after 24 hours

    def store(self, entry: MemoryEntry):
        key = f"memory:{entry.user_id}"
        record = {
            "content": entry.content,
            "type": entry.memory_type,
            "importance": entry.importance,
            "created_at": entry.created_at,
        }
        # Store as a sorted set — score by importance for easy top-N retrieval
        self.r.zadd(key, {json.dumps(record): entry.importance})
        self.r.expire(key, self.ttl)

    def retrieve(self, user_id: str, top_k: int = 5) -> List[dict]:
        key = f"memory:{user_id}"
        # Get top-K by importance score, descending
        items = self.r.zrevrange(key, 0, top_k - 1)
        return [json.loads(item) for item in items]

    def store_session_context(self, session_id: str, messages: list, ttl: int = 3600):
        """Store last N turns for in-session recall"""
        key = f"session:{session_id}"
        self.r.set(key, json.dumps(messages[-20:]), ex=ttl)  # Keep last 20 turns

    def get_session_context(self, session_id: str) -> list:
        key = f"session:{session_id}"
        data = self.r.get(key)
        return json.loads(data) if data else []

Step 6: Wire Memory Into the Claude API Call

This is where everything connects. Before each API call, retrieve relevant memories and inject them into the system prompt as a structured context block. Keep it tight — injecting 50 memories will hurt more than it helps.

import anthropic

client = anthropic.Anthropic()

def run_agent_turn(user_id: str, user_message: str, memory_store, session_id: str = None) -> str:
    # Retrieve relevant memories
    memories = memory_store.retrieve(user_id=user_id, query=user_message, top_k=5)

    # Format memories for injection
    memory_block = ""
    if memories:
        memory_lines = "\n".join([f"- [{m['type']}] {m['content']}" for m in memories])
        memory_block = f"""
<user_memory>
The following facts are known about this user from previous sessions:
{memory_lines}
</user_memory>
"""

    system_prompt = f"""You are a helpful assistant with persistent memory across sessions.
{memory_block}
When responding, incorporate relevant user preferences and history naturally.
Do not explicitly mention that you are reading from memory unless asked."""

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

This pattern keeps your context window usage predictable. Rather than dumping raw conversation history, you’re injecting a curated, scored summary. This is meaningfully cheaper than full context replay — see our breakdown of LLM caching strategies for the numbers behind why this matters at volume.

Step 7: Add Memory Write-Back

After each turn, extract new facts and store them. Use a lightweight Claude call with a structured extraction prompt — or on a tight budget, a simple regex/keyword pass. Here’s the Claude-based approach:

def extract_and_store_memories(user_id: str, user_message: str, assistant_response: str, memory_store):
    extraction_prompt = f"""Given this conversation exchange, extract any new facts, preferences, or decisions worth remembering for future sessions.

User said: {user_message}
Assistant responded: {assistant_response}

Return ONLY a JSON array of memory objects, or [] if nothing memorable.
Format: [{{"content": "...", "type": "preference|fact|task|decision", "importance": 0.1-1.0}}]
Be selective — only extract genuinely useful, lasting information."""

    extraction = client.messages.create(
        model="claude-haiku-4-5",   # Use Haiku for extraction — it's fast and cheap
        max_tokens=512,
        messages=[{"role": "user", "content": extraction_prompt}]
    )

    import json, re
    raw = extraction.content[0].text
    # Strip markdown fences if present
    raw = re.sub(r"```json|```", "", raw).strip()

    try:
        extracted = json.loads(raw)
        for item in extracted:
            entry = MemoryEntry(
                user_id=user_id,
                content=item["content"],
                memory_type=item.get("type", "fact"),
                importance=float(item.get("importance", 0.5)),
            )
            memory_store.store(entry)
    except json.JSONDecodeError:
        pass  # Log this in production — silent failure is bad practice here

Using Haiku for extraction costs roughly $0.0003 per extraction call. At 1,000 daily conversations, that’s $0.30/day for the write-back layer — negligible. This is also a good pattern to study if you’re building more complex multi-step prompt chains that depend on prior state.

Common Errors

Memory injection inflating token costs unexpectedly

If you retrieve top_k=10 memories averaging 100 tokens each, you’re adding 1,000 tokens to every system prompt. At Claude Opus pricing ($15/M input), that’s $0.015 per 1,000 calls — not catastrophic, but it compounds. Fix: Cap retrieved memories at 5, enforce a 50-token max per memory at write time, and use importance scoring to prune aggressively. Monitor with per-call token logging.

Qdrant filter returning empty results for valid users

This happens when you store a user_id with one type (e.g. integer) and query with another (e.g. string). Qdrant’s payload filter is type-strict. Fix: Always cast user_id to string at both write and query time. Add an assertion in your store method: assert isinstance(entry.user_id, str).

Duplicate memories accumulating over time

If a user repeatedly mentions the same preference, your agent will store it five times and inject it five times. Fix: Before storing, do a similarity check — if the closest existing memory scores above 0.92 cosine similarity, skip the write. Add this to your VectorMemoryStore.store() method as a dedup guard.

def _is_duplicate(self, user_id: str, content: str, threshold: float = 0.92) -> bool:
    embedding = self.encoder.encode(content).tolist()
    results = self.client.search(
        collection_name=self.collection,
        query_vector=embedding,
        query_filter={"must": [{"key": "user_id", "match": {"value": user_id}}]},
        limit=1,
        with_payload=False,
    )
    return len(results) > 0 and results[0].score >= threshold

Choosing the Right Architecture

Solo founder, single server, <1,000 users: Start with SQLite. Zero infrastructure, zero ops, structured queries for free. Add vector search only when users ask “why doesn’t it remember what I said last month?” — that’s your signal.

Small team, multi-user product, moderate scale: Qdrant (self-hosted on a $12/mo server) + Redis for session context. This covers semantic recall and fast in-session state without managed database costs.

Enterprise / high concurrency: Managed Pinecone or Weaviate Cloud for vectors, Redis Cluster for session state, PostgreSQL with pgvector as a backup. You’ll also want observability on memory retrieval latency — check out the guide on observability for production Claude agents for the instrumentation patterns that actually surface problems.

If you’re deploying these agents serverlessly, the memory connection pooling story changes significantly — worth reading the serverless platform comparison before you commit to an architecture, since cold starts interact badly with SQLite file locks and Redis connection limits.

What to Build Next

The natural extension is a memory summarization job that runs nightly. As memories accumulate, older low-importance entries get summarized into a single “user profile” document and the raw memories are pruned. This keeps retrieval fast and context injections concise. Implement it as a cron job using Claude Haiku to summarize per-user memory clusters — the pattern is the same write-back loop shown in Step 7, but operating on stored memories instead of live conversation turns. Wire it with a scheduled Claude agent and you have a fully autonomous memory management system.

Frequently Asked Questions

How do I give Claude agents persistent memory without a vector database?

SQLite is the simplest path — store memories in a local database and retrieve them by user ID and type before each API call. You lose semantic search but gain zero infrastructure overhead. There’s a detailed guide on implementing Claude agent memory without a database if you want an even lighter approach using flat files.

What’s the best embedding model for agent memory retrieval?

all-MiniLM-L6-v2 is the practical default — 384 dimensions, runs on CPU at ~14ms, free to use locally. If you need higher quality retrieval for domain-specific content, text-embedding-3-small from OpenAI costs ~$0.02 per million tokens and outperforms MiniLM on most benchmarks. Avoid over-engineering: for <50k memories, model quality matters less than retrieval consistency.

How many memories should I inject per Claude API call?

3–7 is the practical sweet spot. More than 10 starts degrading response quality as Claude has to reconcile potentially conflicting or irrelevant context. Keep each memory under 100 tokens at write time, and always rank by relevance score before injecting — not by recency.

Can I use Redis alone for persistent memory, or is it too volatile?

Redis with persistence enabled (AOF or RDB snapshots) is reliable enough for many production workloads, but it’s not designed for long-term semantic retrieval. Use Redis for session-scoped context (last 20 turns) and as a fast cache layer, with a durable store (SQLite or a vector DB) as the source of truth for long-term memory.

How do I prevent my Claude agent from injecting outdated or wrong memories?

Add a TTL to low-importance memories and a deduplication check at write time. For critical facts (user’s company name, product preferences), include an update mechanism — if the new memory contradicts an existing one above a similarity threshold, overwrite rather than append. Explicit user corrections (“actually I prefer X now”) should trigger a targeted delete of the conflicting entry.

Put this into practice

Try the Architecture Modernizer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building Claude agents with persistent memory across sessions: production architecture guide

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building Claude agents with persistent memory across sessions: production architecture guide

Step 1: Install Dependencies

Step 2: Define the Memory Schema

Step 3: Implement Vector DB Memory (Qdrant)

Step 4: Implement SQLite Memory

Step 5: Implement Redis Memory

Step 6: Wire Memory Into the Claude API Call

Step 7: Add Memory Write-Back

Common Errors

Memory injection inflating token costs unexpectedly

Qdrant filter returning empty results for valid users

Duplicate memories accumulating over time

Choosing the Right Architecture

What to Build Next

Frequently Asked Questions

How do I give Claude agents persistent memory without a vector database?

What’s the best embedding model for agent memory retrieval?

How many memories should I inject per Claude API call?

Can I use Redis alone for persistent memory, or is it too volatile?

How do I prevent my Claude agent from injecting outdated or wrong memories?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation