If you've played with the agents from the last few days you've probably noticed something frustrating: every time you start a new conversation, the agent has no idea who you are or what you discussed before. That's not a bug — it's the default behaviour. LLMs are stateless; they only know what's in the current context window.
Memory is how we fix that. There are three distinct tiers, each with different trade-offs, and picking the wrong one is a common source of over-engineering (or under-engineering). Let's break them down.
The simplest form of memory is the conversation history you pass in the messages array every turn. The model "remembers" everything in that list.
messages = [
{"role": "user", "content": "My name is Hitesh."},
{"role": "assistant", "content": "Hi Hitesh! How can I help?"},
{"role": "user", "content": "What's my name?"}, # model will say "Hitesh"
]Strengths: Zero setup. Works out of the box with every SDK.
Limitations: Bounded by the model's context window. claude-3-7-sonnet supports up to 200k tokens; gpt-4.1 up to 1M — but token cost scales linearly with history length, and very long conversations slow down inference. In-context memory also vanishes when the process restarts.
Use it when: you need memory only within a single session and the conversation will stay under a few hundred turns.
For memory that persists across sessions, you need to write it somewhere outside the process. The two common shapes are:
Key-value store — good for structured facts: user preferences, entity properties, configuration. Fast reads, no semantic search.
Vector store — good for unstructured text you want to query by meaning: past conversations, documents, notes. Slightly more setup, but lets you ask "what have we talked about that's relevant to X?" rather than "give me the record for key Y".
Here's a minimal persistent memory store backed by a JSON file. It's production-ready for low-volume use and easy to swap for Redis or DynamoDB later.
import json
import os
from pathlib import Path
from datetime import datetime
MEMORY_PATH = Path("agent_memory.json")
def load_memory() -> dict:
if MEMORY_PATH.exists():
return json.loads(MEMORY_PATH.read_text())
return {}
def save_memory(memory: dict) -> None:
MEMORY_PATH.write_text(json.dumps(memory, indent=2))
def remember(key: str, value: str) -> None:
"""Store a fact in long-term memory."""
memory = load_memory()
memory[key] = {"value": value, "updated_at": datetime.utcnow().isoformat()}
save_memory(memory)
def recall(key: str) -> str | None:
"""Retrieve a fact from long-term memory."""
memory = load_memory()
entry = memory.get(key)
return entry["value"] if entry else NoneExpose remember and recall as tools so the agent can decide what to store and when to look things up:
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "remember",
"description": "Store an important fact about the user or the current project for future sessions.",
"input_schema": {
"type": "object",
"properties": {
"key": {"type": "string", "description": "A short identifier, e.g. 'user_name' or 'preferred_language'"},
"value": {"type": "string", "description": "The fact to store"},
},
"required": ["key", "value"],
},
},
{
"name": "recall",
"description": "Look up a previously stored fact by key.",
"input_schema": {
"type": "object",
"properties": {
"key": {"type": "string"},
},
"required": ["key"],
},
},
]
def run_tool(name: str, inputs: dict) -> str:
if name == "remember":
remember(inputs["key"], inputs["value"])
return f"Stored: {inputs['key']} = {inputs['value']}"
if name == "recall":
value = recall(inputs["key"])
return value if value else f"No memory found for key '{inputs['key']}'"Now the agent can say "I'll remember your preferred language is TypeScript" and it will persist across restarts.
Episodic memory is for retaining the gist of past conversations — not every token, but a compressed summary that can be injected into the context of a new session.
The pattern:
def summarise_session(messages: list[dict]) -> str:
"""Ask Claude to produce a one-paragraph episode summary."""
conversation_text = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in messages
)
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=256,
messages=[
{
"role": "user",
"content": f"Summarise this conversation in 2–3 sentences, capturing key decisions and context:\n\n{conversation_text}",
}
],
)
return response.content[0].text
def build_system_prompt(user_id: str) -> str:
episodes = recall(f"episodes_{user_id}") or "No previous sessions."
return f"""You are a helpful assistant.
Previous session context:
{episodes}
Use this context when relevant, but don't mention it unless asked."""This keeps your context window lean while preserving continuity across sessions.
| Scenario | Recommended tier |
|---|---|
| Multi-turn chat within one session | In-context (messages array) |
| User preferences / profile facts | Long-term key-value |
| "What did we discuss yesterday?" | Episodic summaries |
| "Find docs relevant to my question" | Vector store (see Day 9) |
| All of the above in production | All three, layered |
The JSON file store above is a drop-in for low traffic. When you're ready to scale:
json.loads(MEMORY_PATH.read_text()) for redis.get(key). No other changes.Tomorrow we leave hosted models and go local: NVIDIA Nemotron 3 running on your own machine via Ollama, wired into the same agent loop with an OpenAI-compatible endpoint.