By the end of this tutorial, you’ll have a working Claude agent that remembers user preferences, past decisions, and conversation summaries across completely separate sessions — stored in plain JSON files, with zero database setup. Claude agent persistent memory doesn’t require Redis, Postgres, or any external infrastructure. You just need a smart context injection strategy and about 80 lines of Python.
Most tutorials skip straight to vector databases when “memory” comes up. That’s overkill for 90% of use cases. If you’re building a personal assistant, a sales agent that tracks prospect history, or any agent handling repeat users, flat file storage with structured summaries is faster to ship, cheaper to run, and easier to debug.
- Install dependencies — set up the Anthropic SDK and folder structure
- Design the memory schema — define what gets stored and how it’s structured
- Build the memory manager — read/write/summarize session state to disk
- Inject memory into the system prompt — wire context into each Claude call
- Extract and persist new memories — have Claude update its own memory after each session
- Test across sessions — verify continuity with a real conversation loop
Step 1: Install Dependencies
You need anthropic and nothing else for the core logic. pathlib and json are stdlib. Python 3.10+ recommended.
pip install anthropic python-dotenv
Create your project structure:
mkdir claude-memory-agent && cd claude-memory-agent
mkdir memory
touch agent.py memory_manager.py .env
Add your API key to .env:
ANTHROPIC_API_KEY=sk-ant-...
Step 2: Design the Memory Schema
Don’t store raw conversation logs — that wastes tokens fast. Store distilled facts: preferences, decisions, key context, and a rolling summary. Here’s the schema we’ll use:
{
"user_id": "alice",
"last_session": "2024-01-15T14:32:00",
"session_count": 7,
"preferences": {
"communication_style": "direct, no fluff",
"timezone": "UTC+1",
"primary_goal": "launch SaaS product by Q2"
},
"key_facts": [
"Using Python and FastAPI for backend",
"Has a team of 2 engineers",
"Previously rejected idea of adding Redis to stack"
],
"running_summary": "Alice is building a B2B SaaS tool for contract review. She wants weekly progress check-ins and prefers bullet-pointed responses. Last session she finalized the pricing page copy.",
"recent_decisions": [
{"topic": "pricing model", "decision": "usage-based at $0.02/doc", "date": "2024-01-14"}
]
}
The running_summary is the key piece — it compresses prior sessions into a single paragraph that fits comfortably in the system prompt. This is the same principle discussed in our article on LLM caching strategies for cutting API costs, where reusing structured context across calls saves both money and latency.
Step 3: Build the Memory Manager
# memory_manager.py
import json
import os
from pathlib import Path
from datetime import datetime
MEMORY_DIR = Path("memory")
def load_memory(user_id: str) -> dict:
"""Load user memory from disk, or return a blank slate."""
path = MEMORY_DIR / f"{user_id}.json"
if path.exists():
with open(path) as f:
return json.load(f)
# First-time user — return empty template
return {
"user_id": user_id,
"last_session": None,
"session_count": 0,
"preferences": {},
"key_facts": [],
"running_summary": "",
"recent_decisions": []
}
def save_memory(user_id: str, memory: dict) -> None:
"""Persist updated memory to disk."""
MEMORY_DIR.mkdir(exist_ok=True)
memory["last_session"] = datetime.utcnow().isoformat()
memory["session_count"] = memory.get("session_count", 0) + 1
path = MEMORY_DIR / f"{user_id}.json"
with open(path, "w") as f:
json.dump(memory, f, indent=2)
def format_memory_for_prompt(memory: dict) -> str:
"""Convert memory dict into a compact string for the system prompt."""
if not memory.get("running_summary") and not memory.get("key_facts"):
return "No prior context available. This is a new user."
lines = ["## User Context (from previous sessions)"]
if memory["running_summary"]:
lines.append(f"**Summary:** {memory['running_summary']}")
if memory["key_facts"]:
lines.append("**Key facts:**")
for fact in memory["key_facts"][-5:]: # keep last 5 to avoid bloat
lines.append(f"- {fact}")
if memory["preferences"]:
prefs = ", ".join(f"{k}: {v}" for k, v in memory["preferences"].items())
lines.append(f"**Preferences:** {prefs}")
if memory.get("recent_decisions"):
last = memory["recent_decisions"][-1]
lines.append(f"**Last decision:** {last['topic']} → {last['decision']}")
return "\n".join(lines)
Step 4: Inject Memory into the System Prompt
This is where most implementations go wrong — they dump the entire memory blob into the user message instead of the system prompt. Keeping it in the system prompt means Claude treats it as persistent background context rather than something the user said.
# agent.py
import os
from anthropic import Anthropic
from dotenv import load_dotenv
from memory_manager import load_memory, save_memory, format_memory_for_prompt
load_dotenv()
client = Anthropic()
BASE_SYSTEM_PROMPT = """You are a helpful assistant with persistent memory across sessions.
You remember the user's preferences, past decisions, and prior context.
Use the provided user context to give continuity — reference it naturally without being robotic about it.
{memory_context}
"""
def build_system_prompt(user_id: str, memory: dict) -> str:
memory_context = format_memory_for_prompt(memory)
return BASE_SYSTEM_PROMPT.format(memory_context=memory_context)
def run_session(user_id: str):
memory = load_memory(user_id)
system_prompt = build_system_prompt(user_id, memory)
conversation_history = []
print(f"\n--- Session {memory['session_count'] + 1} for {user_id} ---\n")
if memory["running_summary"]:
print(f"[Memory loaded: {memory['running_summary'][:80]}...]\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ["exit", "quit", "bye"]:
# Before exiting, extract and save updated memory
update_memory_from_session(client, user_id, memory, conversation_history)
print("Memory saved. Goodbye.")
break
conversation_history.append({"role": "user", "content": user_input})
response = client.messages.create(
model="claude-haiku-4-5", # ~$0.0008 per 1K input tokens — cheap for this loop
max_tokens=1024,
system=system_prompt,
messages=conversation_history
)
assistant_message = response.content[0].text
conversation_history.append({"role": "assistant", "content": assistant_message})
print(f"\nClaude: {assistant_message}\n")
Step 5: Extract and Persist New Memories
After each session, run a second Claude call — a cheap extraction pass — to update the memory file. This is the piece most people skip, and it’s what makes the system actually useful long-term.
def update_memory_from_session(
client: Anthropic,
user_id: str,
existing_memory: dict,
conversation: list
) -> None:
"""Ask Claude to extract memory updates from the conversation."""
if not conversation:
return
# Format conversation for extraction
conv_text = "\n".join(
f"{msg['role'].upper()}: {msg['content']}"
for msg in conversation
)
extraction_prompt = f"""
You are a memory extraction assistant. Given a conversation, extract structured updates.
EXISTING MEMORY:
{existing_memory}
NEW CONVERSATION:
{conv_text}
Return a JSON object with these fields (only include fields that changed or have new info):
{{
"running_summary": "Updated 2-3 sentence summary combining old and new context",
"new_facts": ["any new key facts learned"],
"updated_preferences": {{"key": "value"}},
"new_decision": {{"topic": "...", "decision": "...", "date": "today"}}
}}
Return ONLY valid JSON. If nothing meaningful was learned, return an empty object {{}}.
"""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{"role": "user", "content": extraction_prompt}]
)
import json, re
raw = response.content[0].text.strip()
# Strip markdown code fences if present
raw = re.sub(r"^```(?:json)?\n?", "", raw)
raw = re.sub(r"\n?```$", "", raw)
try:
updates = json.loads(raw)
except json.JSONDecodeError:
print("[Warning: memory extraction returned invalid JSON — skipping save]")
return
if not updates:
save_memory(user_id, existing_memory)
return
# Merge updates into existing memory
if "running_summary" in updates:
existing_memory["running_summary"] = updates["running_summary"]
if "new_facts" in updates:
existing_memory["key_facts"].extend(updates["new_facts"])
existing_memory["key_facts"] = existing_memory["key_facts"][-10:] # cap at 10
if "updated_preferences" in updates:
existing_memory["preferences"].update(updates["updated_preferences"])
if "new_decision" in updates and updates["new_decision"]:
existing_memory["recent_decisions"].append(updates["new_decision"])
existing_memory["recent_decisions"] = existing_memory["recent_decisions"][-5:]
save_memory(user_id, existing_memory)
print(f"[Memory updated for {user_id}]")
The extraction call on claude-haiku-4-5 costs roughly $0.0003–0.0008 per session depending on conversation length. Negligible. For production agents that handle hundreds of users, you’d want to look at managing LLM API costs at scale to track this across your fleet.
Step 6: Test Across Sessions
Add the entry point and run it twice to verify continuity:
# bottom of agent.py
if __name__ == "__main__":
user_id = input("Enter user ID: ").strip() or "default_user"
run_session(user_id)
python agent.py
# Session 1: Tell Claude your name, your project, your preferred communication style
# Type 'exit' to save
python agent.py
# Session 2: Claude should reference what you told it — try asking "what do you remember about me?"
After two sessions you’ll see a memory/your_user_id.json file with the distilled facts from both conversations. The agent will open session 3 already knowing your project context without you repeating yourself.
This approach pairs well with more complex agent architectures — if you’re building multi-step workflows, see how advanced prompt chaining for Claude can extend this pattern into longer task pipelines.
Common Errors
Memory extraction returns malformed JSON
Claude occasionally wraps JSON in markdown code fences or adds explanation text before the object. The re.sub stripping in step 5 handles most cases, but if you’re seeing persistent failures, add "Return ONLY the JSON object, no explanation, no markdown" to your extraction prompt. If you need reliable structured output more broadly, the full treatment is covered in structured output mastery for Claude.
Memory file grows unbounded
The caps in the code (key_facts[-10:], recent_decisions[-5:]) prevent this for facts and decisions, but running_summary can slowly bloat if your extraction prompt lets Claude append instead of rewrite. Fix: explicitly instruct “rewrite the summary in 2-3 sentences max — compress, don’t append.” Add a hard character cap check before saving:
if len(existing_memory["running_summary"]) > 500:
existing_memory["running_summary"] = existing_memory["running_summary"][:500]
Context doesn’t appear in responses
If Claude ignores the memory context, your system prompt injection isn’t working. Verify with a debug print before the API call:
print("SYSTEM PROMPT:", system_prompt[:300]) # should show memory context
Also check that system= is a keyword argument in your client.messages.create call — passing it as a positional arg fails silently with some SDK versions. Pin to anthropic>=0.25.0 to avoid SDK quirks.
When This Approach Breaks Down
Flat file memory works until you hit one of these walls: multiple processes writing to the same user file simultaneously (race conditions), needing to search across all user memories semantically, or managing thousands of concurrent users where filesystem I/O becomes a bottleneck. At that point you want SQLite at minimum, or a proper vector store if you need semantic retrieval — see our comparison of Pinecone vs Weaviate vs Qdrant for RAG agents when you’re ready to graduate.
For anything under ~500 active users with non-concurrent sessions, file storage will hold up fine. A JSON file per user is also trivially portable — zip the memory folder and move it anywhere, no migration scripts required.
What to Build Next
Add a memory confidence score. When Claude extracts facts, have it also emit a confidence level (high/medium/low) for each fact. During prompt injection, only include high-confidence facts automatically — surface medium-confidence ones with a note like “I think you mentioned X but I’m not certain.” This prevents the agent from reinforcing incorrect memories over time, which is the main failure mode in production assistants that handle sensitive user information.
Frequently Asked Questions
How much does Claude agent persistent memory cost per session?
With claude-haiku-4-5, the memory extraction call at the end of each session costs roughly $0.0003–$0.0008 depending on conversation length. The system prompt injection adds ~200–400 tokens per call, which at Haiku pricing adds about $0.00016–$0.00032 per API call. For a 10-turn session, your total memory overhead is under $0.005.
Can I use this pattern with other LLMs besides Claude?
Yes — the memory manager and file storage are model-agnostic. Swap the Anthropic client for OpenAI or any other provider. The extraction prompt works with GPT-4o and most instruction-following models. You may need to adjust the JSON extraction prompt slightly if you see higher rates of malformed output from other models.
How do I handle multiple users in a web app with this approach?
Use the authenticated user’s ID (from your session or JWT) as the user_id parameter. Each user gets their own JSON file. For concurrent write safety in a web context, use a file lock library like filelock around the save_memory call. For more than ~200 concurrent active users, migrate to SQLite with a simple schema — the memory dict maps directly to a single table row.
What’s the difference between storing raw conversation history vs. summarized memory?
Raw history is accurate but expensive — 10 sessions of 20 turns each is 200 messages that inflate your context window and cost. Summarized memory is lossy but cheap: a 300-token memory block replaces thousands of tokens of history. For most agents, the distilled facts are more useful than verbatim history anyway. Use raw history only if your use case requires exact recall of prior messages (e.g., legal or compliance agents).
How do I prevent Claude from hallucinating memories that don’t exist?
Two things help: first, instruct the model explicitly in the system prompt to “only reference memories from the context provided, never invent history.” Second, during memory extraction, require a null value for fields where no relevant information was learned rather than having Claude fill in guesses. Hallucinated memory is rare but becomes a real issue in long-running assistants without these guardrails.
Put this into practice
Try the Database Admin agent — ready to use, no setup required.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

