Yesterday we used OpenAI's File Search to give an assistant access to documents. File Search is convenient, but it's a black box — you can't control how documents are chunked, which embedding model is used, or how results are ranked. When those things matter (and in production they often do), you build your own RAG pipeline.
RAG stands for Retrieval-Augmented Generation. The idea is simple: instead of hoping the LLM's training data contains the answer, you retrieve relevant text from your own documents at query time and include it in the context. The model reasons over your data, not guessed training data.
You could. claude-3-7-sonnet supports 200k tokens — that's roughly a 150-page book. But:
RAG retrieves only what's relevant — typically 3–10 chunks — keeping cost, latency, and noise low.
Documents → Chunk → Embed → Store in vector DB
↓
Query → Embed → Similarity search → Top-k chunks → LLM → AnswerEvery RAG implementation follows this shape. We'll use:
text-embedding-3-small — fast, cheap, good quality. (Or use a local embedding model via Ollama.)pip install chromadb anthropic openaiChunking strategy affects quality more than most developers expect. A few rules of thumb:
def chunk_text(text: str, chunk_size: int = 400, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks by approximate token count."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap
return chunksimport chromadb
from openai import OpenAI
openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("knowledge_base")
def embed(texts: list[str]) -> list[list[float]]:
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]
def ingest_document(doc_id: str, text: str) -> None:
"""Chunk, embed, and store a document."""
chunks = chunk_text(text)
embeddings = embed(chunks)
collection.add(
ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
documents=chunks,
embeddings=embeddings,
metadatas=[{"source": doc_id} for _ in chunks],
)
print(f"Ingested {len(chunks)} chunks from {doc_id}")PersistentClient writes to disk — your vector index survives process restarts.
def retrieve(query: str, n_results: int = 4) -> list[str]:
"""Return the top-k most relevant chunks for a query."""
query_embedding = embed([query])[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
)
return results["documents"][0] # list of chunk stringsExpose retrieve as a tool so the agent decides when to consult the knowledge base:
import anthropic
claude = anthropic.Anthropic()
tools = [
{
"name": "search_knowledge_base",
"description": (
"Search the internal knowledge base for information relevant to the query. "
"Use this before answering any question that may require specific facts, "
"documentation, or internal procedures."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query — phrase it as a question or key concept.",
},
},
"required": ["query"],
},
}
]
def run_agent(user_question: str) -> str:
messages = [{"role": "user", "content": user_question}]
while True:
response = claude.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
tools=tools,
messages=messages,
)
if response.stop_reason == "end_turn":
return response.content[0].text
if response.stop_reason == "tool_use":
tool_use = next(b for b in response.content if b.type == "tool_use")
chunks = retrieve(tool_use.input["query"])
context = "\n\n---\n\n".join(chunks)
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": context,
}
],
})The agent decides when retrieval is needed. If the answer is obvious from context, it skips the tool call entirely — no wasted embedding calls.
# Ingest some documents once
ingest_document("onboarding", open("docs/onboarding.txt").read())
ingest_document("api_reference", open("docs/api.txt").read())
# Then answer questions
answer = run_agent("What's the process for resetting a user's 2FA?")
print(answer)The ChromaDB setup above handles millions of vectors comfortably on a laptop. Upgrade when:
The retrieval function signature (query: str → list[str]) stays identical regardless of the backend — swapping is a one-line change.
Tomorrow we connect multiple specialised agents together using LangGraph — a graph-based orchestration library that makes multi-agent pipelines easy to reason about and extend.