Sunday, April 5

Most Claude agent implementations are stateless by default — every conversation starts cold, with no memory of what happened before. If you’re building anything beyond a single-turn chatbot, that’s a serious constraint. Stateful agents memory is the difference between an assistant that learns your codebase over weeks and one that asks you to re-explain your stack every session. The good news: you don’t need Redis, Postgres, or a vector database to build agents that remember. You need the right patterns and a clear-eyed understanding of what each one costs you.

This article covers four practical memory strategies you can implement today — from in-process context windows to file-backed rolling summaries — with working code and honest tradeoffs for each.

Why “Just Use a Database” Isn’t Always the Right Answer

The default advice for persistent agent state is “spin up a vector DB and store embeddings.” That works. It’s also overkill for a huge range of use cases. A solo founder building a client-facing research agent doesn’t want to manage a Pinecone index. A developer shipping a prototype needs something that works on a weekend, not something that requires infrastructure decisions before writing a single line of logic.

Database-free memory strategies make sense when:

  • Your agent handles one user or a small, bounded set of users
  • Memory volume is modest — hundreds of facts, not millions
  • You’re prototyping and want to validate behavior before adding infra
  • You’re deploying serverless or in constrained environments
  • The latency cost of a DB round-trip matters at your scale

None of these strategies are “better” than a proper database at scale. But they’re faster to implement, cheaper to run, and surprisingly capable for mid-size production workloads.

Strategy 1: Compressed Context Window Accumulation

The simplest form of agent memory is just keeping a running transcript and passing it back with every request. The problem is that Claude’s context window, while large (200k tokens on Claude 3 models), isn’t infinite — and at roughly $3 per million input tokens on Sonnet, you burn money fast with naive accumulation.

The smarter version is selective compression: after each exchange, you summarize the conversation so far into a compact “memory block” rather than appending raw turns.

import anthropic

client = anthropic.Anthropic()

def compress_history(history: list[dict], model="claude-3-haiku-20240307") -> str:
    """Compress conversation history into a compact memory block."""
    history_text = "\n".join(
        f"{msg['role'].upper()}: {msg['content']}" for msg in history
    )
    response = client.messages.create(
        model=model,
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": (
                f"Summarize the following conversation into a concise memory block. "
                f"Preserve: decisions made, facts established, user preferences, "
                f"any open questions. Be specific, not vague.\n\n{history_text}"
            )
        }]
    )
    return response.content[0].text

class CompressedMemoryAgent:
    def __init__(self, system_prompt: str):
        self.system = system_prompt
        self.memory_block = ""       # compressed long-term memory
        self.recent_turns = []       # last N raw turns (short-term)
        self.max_recent = 6          # keep last 3 exchanges uncompressed
        self.compress_every = 6      # compress after every 6 turns

    def chat(self, user_message: str) -> str:
        # Build messages: inject memory block as context if we have one
        messages = []
        if self.memory_block:
            messages.append({
                "role": "user",
                "content": f"[MEMORY CONTEXT]\n{self.memory_block}\n[END MEMORY]"
            })
            messages.append({
                "role": "assistant",
                "content": "Understood, I have the context from our previous interactions."
            })

        messages.extend(self.recent_turns)
        messages.append({"role": "user", "content": user_message})

        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            system=self.system,
            messages=messages
        )
        assistant_reply = response.content[0].text

        # Update recent turns
        self.recent_turns.append({"role": "user", "content": user_message})
        self.recent_turns.append({"role": "assistant", "content": assistant_reply})

        # Compress when we hit the threshold
        if len(self.recent_turns) >= self.compress_every:
            new_summary = compress_history(self.recent_turns)
            # Merge with existing memory block
            if self.memory_block:
                self.memory_block = compress_history([
                    {"role": "system", "content": f"Previous memory: {self.memory_block}"},
                    {"role": "system", "content": f"New events: {new_summary}"}
                ])
            else:
                self.memory_block = new_summary
            self.recent_turns = []  # clear after compression

        return assistant_reply

The compression step costs roughly $0.0001 per run at Haiku pricing — negligible. The memory block stays under 500 tokens, so your context overhead is small and predictable. The tradeoff: fine-grained detail gets lost in summarization. If your agent needs to recall exact phrasing or specific numbers from three sessions ago, you’ll need a different approach.

Strategy 2: Structured JSON State Files

For agents that need to track discrete, structured facts — user preferences, task state, entity attributes — a local JSON file beats compressed summaries. It’s explicit, inspectable, and survives process restarts without any infra.

import json
import os
from pathlib import Path

class PersistentStateAgent:
    def __init__(self, state_file: str, system_prompt: str):
        self.state_path = Path(state_file)
        self.system = system_prompt
        self.state = self._load_state()

    def _load_state(self) -> dict:
        if self.state_path.exists():
            with open(self.state_path) as f:
                return json.load(f)
        # Default state schema
        return {
            "user_preferences": {},
            "known_facts": [],
            "task_history": [],
            "session_count": 0
        }

    def _save_state(self):
        with open(self.state_path, "w") as f:
            json.dump(self.state, f, indent=2)

    def _extract_state_updates(self, user_msg: str, assistant_reply: str) -> dict:
        """Ask Claude to extract structured updates from the exchange."""
        extraction_prompt = f"""
Given this exchange, extract any new facts, preferences, or task updates.
Return ONLY valid JSON matching this schema (use null if nothing new):
{{
  "new_facts": ["fact1", "fact2"] or null,
  "preferences": {{"key": "value"}} or null,
  "task_update": {{"description": "...", "status": "..."}} or null
}}

User: {user_msg}
Assistant: {assistant_reply}
"""
        response = client.messages.create(
            model="claude-3-haiku-20240307",  # cheap extraction model
            max_tokens=256,
            messages=[{"role": "user", "content": extraction_prompt}]
        )
        try:
            return json.loads(response.content[0].text)
        except json.JSONDecodeError:
            return {}  # extraction failed, don't crash the agent

    def chat(self, user_message: str) -> str:
        # Inject current state as context
        state_context = json.dumps(self.state, indent=2)
        messages = [
            {
                "role": "user",
                "content": f"[AGENT STATE]\n{state_context}\n[/AGENT STATE]\n\n{user_message}"
            }
        ]

        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            system=self.system,
            messages=messages
        )
        reply = response.content[0].text

        # Extract and persist state updates
        updates = self._extract_state_updates(user_message, reply)
        if updates.get("new_facts"):
            self.state["known_facts"].extend(updates["new_facts"])
        if updates.get("preferences"):
            self.state["user_preferences"].update(updates["preferences"])
        if updates.get("task_update"):
            self.state["task_history"].append(updates["task_update"])

        self.state["session_count"] += 1
        self._save_state()
        return reply

The extraction call costs about $0.0002 per exchange on Haiku. The JSON file is human-readable, trivially editable, and easy to back up. For single-user agents or automation workflows running on a server, this is often the right call over anything more complex.

What Breaks with JSON State

Two things will bite you. First, the extraction step sometimes misses nuance — Claude summarizes “the user wants fast responses” as a preference but doesn’t know to tag it as overriding a previous preference. Second, large JSON state (hundreds of facts) bloats your context window quickly. If you’re injecting the entire state object with every call, you’ll hit cost and latency issues past about 200 facts. At that point, you need selective retrieval — which is where vector search actually earns its keep.

Strategy 3: Rolling File-Based Episode Memory

A middle ground between compressed summaries and structured JSON: append-only episode logs with a rolling summary. Think of it like a ship’s log — every session gets a brief entry, and periodically you consolidate entries into a higher-level summary.

from datetime import datetime

EPISODES_FILE = "agent_episodes.txt"
SUMMARY_FILE = "agent_summary.txt"
MAX_EPISODES_BEFORE_CONSOLIDATION = 10

def log_episode(user_goal: str, outcome: str, key_facts: list[str]):
    """Append a single session episode to the log."""
    entry = (
        f"[{datetime.now().isoformat()}]\n"
        f"Goal: {user_goal}\n"
        f"Outcome: {outcome}\n"
        f"Facts: {'; '.join(key_facts)}\n"
        f"---\n"
    )
    with open(EPISODES_FILE, "a") as f:
        f.write(entry)

def get_memory_context() -> str:
    """Load summary + recent episodes as memory context."""
    summary = ""
    if os.path.exists(SUMMARY_FILE):
        with open(SUMMARY_FILE) as f:
            summary = f.read()

    recent_episodes = ""
    if os.path.exists(EPISODES_FILE):
        with open(EPISODES_FILE) as f:
            recent_episodes = f.read()

    return f"LONG-TERM SUMMARY:\n{summary}\n\nRECENT EPISODES:\n{recent_episodes}"

def maybe_consolidate():
    """Consolidate episodes into summary when threshold is hit."""
    if not os.path.exists(EPISODES_FILE):
        return

    with open(EPISODES_FILE) as f:
        content = f.read()

    episode_count = content.count("---")
    if episode_count < MAX_EPISODES_BEFORE_CONSOLIDATION:
        return

    # Consolidate: merge episodes into updated summary
    existing_summary = ""
    if os.path.exists(SUMMARY_FILE):
        with open(SUMMARY_FILE) as f:
            existing_summary = f.read()

    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=600,
        messages=[{
            "role": "user",
            "content": (
                f"Consolidate this agent memory into a concise summary. "
                f"Preserve important patterns, facts, and user preferences.\n\n"
                f"EXISTING SUMMARY:\n{existing_summary}\n\n"
                f"NEW EPISODES:\n{content}"
            )
        }]
    )
    with open(SUMMARY_FILE, "w") as f:
        f.write(response.content[0].text)

    # Clear processed episodes
    os.remove(EPISODES_FILE)

This pattern works well for agents that run on a schedule — daily research agents, weekly report generators, anything with discrete sessions rather than continuous conversation. The episode log is human-auditable, which matters when something goes wrong in production and you need to understand what the agent “thinks” it knows.

When These Patterns Break Down (And What to Do)

Be honest with yourself about the limits here. None of these strategies handle concurrent users — if two users hit your agent simultaneously and both write to the same state file, you get corruption. Use file locking or per-user state files (easy: key by user ID) to handle this.

Summarization introduces drift. Over many compression cycles, nuance degrades. An agent that starts knowing “user prefers TypeScript, specifically the strict mode config they mentioned in session 3” can end up with just “user prefers TypeScript” after a few compression rounds. If precision matters, keep raw logs alongside summaries and surface them when stakes are high.

These patterns also don’t give you semantic search. If you need to retrieve “what did the user say about deployment three months ago?”, you need embeddings. The file-based approaches here work for recent context and structured facts — not fuzzy retrieval over long histories.

Picking the Right Pattern for Your Use Case

Here’s how I’d choose between these strategies based on what you’re building:

  • Single-session agents or prototypes: Start with compressed context window accumulation. No files, no persistence, easiest to iterate on.
  • Single-user persistent agents (personal assistant, code reviewer): JSON state files. Fast to implement, human-editable, survives restarts.
  • Scheduled/batch agents with discrete sessions: Rolling episode memory. Best auditing, handles long time horizons gracefully.
  • Multi-user agents at any meaningful scale: You’ve outgrown database-free. Use SQLite at minimum, or a proper vector store if you need semantic retrieval.

For solo founders shipping something fast: start with JSON state files. You can add the compression layer when conversations get long, and you can migrate to a real database when you actually need concurrent users — not before. The premature infrastructure jump to vector DBs has killed more prototypes than it’s saved.

The key insight about stateful agents memory is that “memory” is just structured context injection — you’re always working within the constraints of what Claude can see at inference time. The strategies above are different ways of deciding what to keep, what to compress, and what to discard. Get that decision right for your use case, and you can build surprisingly capable agents without any external dependencies at all.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply