Sunday, April 5

By the end of this tutorial, you’ll have a working chatbot that remembers users across sessions — storing conversation history in SQLite for lightweight deployments and optionally upgrading to vector-indexed memory for semantic retrieval. Implementing chatbot memory with the Claude API is one of those problems that looks simple until you hit token limits, multi-user collisions, or a 200k-token context window you’re filling up with irrelevant old messages.

Claude’s API is stateless by design — every request is a clean slate. That’s actually a good thing for scalability, but it means memory is your problem to solve. Here’s how to do it properly.

What You’ll Build

A Python chatbot backend with three memory layers:

  • Session buffer — in-memory list for the current conversation turn
  • Persistent store — SQLite for durable cross-session history
  • Semantic retrieval — optional vector search to pull only the relevant past context

The full implementation handles multiple users, respects token budgets, and degrades gracefully when the database is unavailable. For production error handling patterns around API failures specifically, see our guide on LLM fallback and retry logic for production.

  1. Install dependencies — set up the project with anthropic, sqlite3, and optional vector libraries
  2. Design the database schema — store messages with user IDs, timestamps, and token counts
  3. Build the memory manager — load, trim, and save conversation history
  4. Wire it to Claude — construct the messages array with injected memory
  5. Add semantic retrieval — embed past turns and retrieve by relevance, not recency
  6. Run a multi-turn test — verify persistence survives process restarts

Step 1: Install Dependencies

You need Python 3.9+. The base implementation only requires anthropic and the standard library. For vector memory, add sentence-transformers and numpy.

pip install anthropic==0.28.0
# For vector memory (optional but recommended for long-term deployments):
pip install sentence-transformers==3.0.1 numpy==1.26.4

Pin those versions. The sentence-transformers library has had breaking API changes between minor versions that will silently produce wrong embeddings. Don’t learn this the hard way in production.

Step 2: Design the Database Schema

The schema needs to support fast lookups by user_id and session_id, store raw message content, and track token counts so you can enforce budget limits without re-counting on every load.

import sqlite3
import json
from datetime import datetime

def init_db(db_path: str = "chat_memory.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path, check_same_thread=False)
    conn.execute("PRAGMA journal_mode=WAL")  # Better concurrent read performance
    conn.execute("""
        CREATE TABLE IF NOT EXISTS messages (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            user_id     TEXT NOT NULL,
            session_id  TEXT NOT NULL,
            role        TEXT NOT NULL,          -- 'user' or 'assistant'
            content     TEXT NOT NULL,
            token_count INTEGER DEFAULT 0,
            embedding   TEXT,                   -- JSON-serialized float list (optional)
            created_at  DATETIME DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_user_session
        ON messages (user_id, session_id, created_at DESC)
    """)
    conn.commit()
    return conn

The embedding column stores serialized vectors as JSON text. It’s not as fast as a dedicated vector database, but for under ~50k messages it’s perfectly usable and keeps your stack simple. If you’re scaling to millions of rows, look at our Pinecone vs Qdrant vs Weaviate comparison for the right upgrade path.

Step 3: Build the Memory Manager

This is the core class. It loads history from SQLite, trims to a token budget, and saves new messages. The token budget approach beats naive “keep last N messages” because a single long message can blow your context window just as badly as 20 short ones.

import anthropic

class MemoryManager:
    def __init__(self, conn: sqlite3.Connection, max_tokens: int = 4000):
        self.conn = conn
        self.max_tokens = max_tokens  # Budget for history (not counting current turn)

    def load_history(self, user_id: str, session_id: str) -> list[dict]:
        """Load recent messages within token budget, newest-first then reversed."""
        rows = self.conn.execute("""
            SELECT role, content, token_count FROM messages
            WHERE user_id = ? AND session_id = ?
            ORDER BY created_at DESC
            LIMIT 100
        """, (user_id, session_id)).fetchall()

        selected, total_tokens = [], 0
        for role, content, token_count in rows:
            if total_tokens + token_count > self.max_tokens:
                break
            selected.append({"role": role, "content": content})
            total_tokens += token_count

        selected.reverse()  # Chronological order for the API
        return selected

    def save_message(self, user_id: str, session_id: str,
                     role: str, content: str, token_count: int = 0,
                     embedding: list[float] | None = None):
        self.conn.execute("""
            INSERT INTO messages (user_id, session_id, role, content, token_count, embedding)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (
            user_id, session_id, role, content, token_count,
            json.dumps(embedding) if embedding else None
        ))
        self.conn.commit()

    def estimate_tokens(self, text: str) -> int:
        """Rough estimate: ~4 chars per token. Good enough for budgeting."""
        return max(1, len(text) // 4)

The token estimation here is intentionally rough — it’s a budget guard, not a billing calculator. If you need exact counts, call client.beta.messages.count_tokens(), but that adds an extra API call per save. For most applications, the 4-chars-per-token heuristic keeps you safe enough with a comfortable margin in max_tokens.

Step 4: Wire It to Claude

Now connect the memory manager to the actual API calls. The key is injecting stored history into the messages array before the current user input.

class ChatBot:
    def __init__(self, db_path: str = "chat_memory.db"):
        self.client = anthropic.Anthropic()  # Reads ANTHROPIC_API_KEY from env
        self.conn = init_db(db_path)
        self.memory = MemoryManager(self.conn, max_tokens=4000)
        self.model = "claude-3-5-haiku-20241022"  # ~$0.0008/1k input tokens

    def chat(self, user_id: str, session_id: str, user_message: str) -> str:
        # Load persistent history
        history = self.memory.load_history(user_id, session_id)

        # Build the full messages array
        messages = history + [{"role": "user", "content": user_message}]

        response = self.client.messages.create(
            model=self.model,
            max_tokens=1024,
            system="You are a helpful assistant. You remember the user's previous messages in this conversation.",
            messages=messages
        )

        assistant_reply = response.content[0].text

        # Persist both turns
        user_tokens = self.memory.estimate_tokens(user_message)
        reply_tokens = self.memory.estimate_tokens(assistant_reply)

        self.memory.save_message(user_id, session_id, "user", user_message, user_tokens)
        self.memory.save_message(user_id, session_id, "assistant", assistant_reply, reply_tokens)

        return assistant_reply


# Usage
bot = ChatBot()
print(bot.chat("user_123", "session_abc", "My name is Alex and I work in fintech."))
print(bot.chat("user_123", "session_abc", "What did I just tell you about my job?"))
# Kill and restart the process — history survives
print(bot.chat("user_123", "session_abc", "What do you remember about me?"))

Running this on Claude 3.5 Haiku costs roughly $0.0002–0.0006 per exchange depending on history length — so a chatbot handling 10,000 conversations/day is under $5/day at typical usage patterns. For higher-traffic deployments, check your token consumption with an observability tool early; the costs creep up fast when history buffers fill.

Step 5: Add Semantic Retrieval for Long-Term Memory

Recency-based retrieval breaks down when users return after weeks. “What was my budget last quarter?” requires semantic search, not just the last 20 messages. This step adds embedding-based retrieval as an alternative loading strategy.

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticMemoryManager(MemoryManager):
    def __init__(self, conn: sqlite3.Connection, max_tokens: int = 4000):
        super().__init__(conn, max_tokens)
        # all-MiniLM-L6-v2: 384-dim, ~80MB, fast inference, good retrieval quality
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")

    def save_message(self, user_id, session_id, role, content,
                     token_count=0, embedding=None):
        """Override to auto-generate embeddings on save."""
        if embedding is None and role == "user":  # Only embed user turns
            embedding = self.encoder.encode(content).tolist()
        super().save_message(user_id, session_id, role, content, token_count, embedding)

    def load_relevant_history(self, user_id: str, session_id: str,
                               query: str, top_k: int = 10) -> list[dict]:
        """Retrieve messages semantically similar to the current query."""
        query_vec = self.encoder.encode(query)

        rows = self.conn.execute("""
            SELECT role, content, embedding, token_count FROM messages
            WHERE user_id = ? AND session_id = ? AND embedding IS NOT NULL
            ORDER BY created_at DESC
            LIMIT 500
        """, (user_id, session_id)).fetchall()

        if not rows:
            return self.load_history(user_id, session_id)  # Fallback to recency

        scored = []
        for role, content, emb_json, token_count in rows:
            emb = np.array(json.loads(emb_json))
            # Cosine similarity
            score = float(np.dot(query_vec, emb) /
                         (np.linalg.norm(query_vec) * np.linalg.norm(emb) + 1e-9))
            scored.append((score, role, content, token_count))

        scored.sort(reverse=True)
        selected, total_tokens = [], 0
        for score, role, content, token_count in scored[:top_k]:
            if total_tokens + token_count > self.max_tokens:
                break
            selected.append({"role": role, "content": content})
            total_tokens += token_count

        return selected  # Note: not in chronological order — add sorting if needed

This approach stores embeddings inline in SQLite. It works well up to ~100k messages before cosine similarity over raw rows gets slow. Beyond that, you’re looking at a proper vector database. Our semantic search implementation guide covers the full production path including HNSW indexing and approximate nearest-neighbor search.

Step 6: Run a Multi-Turn Persistence Test

The whole point is cross-session memory. Here’s a minimal test that verifies it:

import subprocess
import sys

def test_persistence():
    bot = ChatBot()
    uid, sid = "test_user", "test_session"

    # Clear any existing test data
    bot.conn.execute("DELETE FROM messages WHERE user_id = ?", (uid,))
    bot.conn.commit()

    # Session 1: establish facts
    r1 = bot.chat(uid, sid, "My product is called Nexus and we're in Series A.")
    print(f"Turn 1: {r1[:80]}...")

    # Simulate a new process by creating a fresh bot instance (new DB connection)
    bot2 = ChatBot()
    r2 = bot2.chat(uid, sid, "What's the name of my product?")
    print(f"Turn 2 (new instance): {r2[:80]}...")

    assert "Nexus" in r2, "Memory didn't persist across instances!"
    print("✓ Persistence test passed")

test_persistence()

If “Nexus” doesn’t appear in the second response, your save isn’t committing before the connection closes — check that conn.commit() is called after every insert.

Common Errors

1. “Messages must alternate between user and assistant roles”

Claude’s API requires strictly alternating roles. If your history has two consecutive user or assistant messages (which happens when a save fails midway), the API throws a 400. Fix: add a deduplication pass when loading history.

def deduplicate_roles(messages: list[dict]) -> list[dict]:
    """Remove consecutive same-role messages, keeping the last one."""
    result = []
    for msg in messages:
        if result and result[-1]["role"] == msg["role"]:
            result[-1] = msg  # Overwrite with the more recent one
        else:
            result.append(msg)
    return result

2. Context window overflow on long-running conversations

The token budget in MemoryManager trims history going in, but your estimate can be off. Add a safety check: if the response comes back with a stop_reason of "max_tokens" and you’re seeing truncated replies, your combined history + current message is exceeding max_tokens on the response side. Reduce max_tokens in the history budget or increase the response max_tokens parameter. This is a common source of subtle bugs — responses look complete but are silently cut off.

3. SQLite “database is locked” under concurrent requests

SQLite’s default timeout is 5 seconds. Under async web server concurrency (FastAPI, etc.), you’ll hit lock contention. Fix: use a connection pool with check_same_thread=False and set a longer timeout, or switch to PostgreSQL. For a web API, I’d use PostgreSQL from day one — the SQLite approach is only appropriate for single-process deployments or local tools.

# For concurrent access, use a timeout
conn = sqlite3.connect(db_path, check_same_thread=False, timeout=30)

When to Use Which Memory Architecture

Choosing the right memory pattern depends on your scale and use case:

  • Solo product / internal tool, <1k users: SQLite + recency-based retrieval. Dead simple, zero infra, works.
  • SaaS product, 1k–100k users: PostgreSQL + recency retrieval with a token budget. Add semantic search if your use case involves long-term recall (personal assistants, account managers, coaching bots).
  • Enterprise / high-volume: Dedicated vector store (Qdrant or Weaviate), async message persistence, conversation summarisation to compress old history before storing. See our guide on Claude agents with persistent memory across sessions for the full production architecture.

If you’re building something with structured user data beyond conversation history — preferences, account info, CRM fields — pair this pattern with explicit profile storage and inject it as a system prompt prefix rather than mixing it into the message history. It’s cleaner to reason about and easier to update without touching the conversation log.

What to Build Next

Add conversation summarisation: when a session exceeds 8,000 tokens of history, call Claude to produce a 200-token summary, store it as a special "summary" role message, and use that as the history prefix going forward. This keeps memory costs flat for long-running conversations and avoids the cliff where old history suddenly disappears. For prompting patterns that make summaries consistent and grounded, our article on reducing LLM hallucinations with structured outputs has applicable techniques — the same grounding strategies that prevent hallucinated facts also prevent summaries that misrepresent what was said.

Frequently Asked Questions

How does chatbot memory work with the Claude API?

The Claude API is stateless — each request has no built-in memory of previous calls. You implement memory by storing past messages in a database and injecting them into the messages array on each new request. The model sees the conversation history as part of the input, not through any session state on Anthropic’s side.

What’s the best database for storing chatbot conversation history?

SQLite works well for single-process apps and prototypes. PostgreSQL is the right call for any multi-user production deployment. If you need semantic search over history (retrieve by topic, not just recency), add a vector store like Qdrant alongside your relational DB — don’t try to do everything in one system.

How do I prevent conversation history from exceeding Claude’s context window?

Track token counts when you save each message and enforce a budget when loading. A hard cap of 4,000–6,000 tokens for injected history leaves plenty of room for the current message and a substantial response. For longer-running conversations, implement summarisation: periodically condense old history into a compact summary and use that instead of the raw messages.

Can I give different users separate memory with the Claude API?

Yes — the API itself doesn’t separate users, so you handle it in your storage layer by partitioning messages by user_id (and optionally session_id). The implementation in this tutorial does exactly this: every database query scopes to a specific user and session, so users never see each other’s history.

Is vector-based memory retrieval worth the complexity over simple recency?

For short sessions (under an hour, focused topic), recency-based retrieval is fine. Vector retrieval pays off when users return after days or weeks and ask about things mentioned long ago, or when conversation topics jump around enough that the most relevant context isn’t the most recent. If you’re building a personal assistant or a long-term coaching tool, vector retrieval is worth the extra setup cost.

How much does it cost to run a chatbot with memory on Claude?

On Claude 3.5 Haiku (the cost-optimised model), a typical exchange with 2,000 tokens of injected history plus a 500-token response costs roughly $0.0018–0.0025. At 10,000 conversations/day, that’s $18–25/day — before any volume discounts. Memory adds cost proportional to how much history you inject, which is why token budgeting matters.

Put this into practice

Try the Ai Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.


Share.
Leave A Reply