RAG for Agents: Give Your Agent a Long-Term Brain

Retrieval-Augmented Generation (RAG) turns a static document collection into a dynamic tool any agent can query. This post builds a full local RAG pipeline — chunking, embedding, storing in ChromaDB — and wires it into a Claude agent as a callable tool, all without any hosted infrastructure.

5 min read

Yesterday we used OpenAI's File Search to give an assistant access to documents. File Search is convenient, but it's a black box — you can't control how documents are chunked, which embedding model is used, or how results are ranked. When those things matter (and in production they often do), you build your own RAG pipeline.

RAG stands for Retrieval-Augmented Generation. The idea is simple: instead of hoping the LLM's training data contains the answer, you retrieve relevant text from your own documents at query time and include it in the context. The model reasons over your data, not guessed training data.

Why Not Just Stuff Documents Into the Context?

You could. claude-3-7-sonnet supports 200k tokens — that's roughly a 150-page book. But:

  • Cost scales linearly with tokens. At $3/M input tokens, 200k tokens per query costs $0.60 per call.
  • Latency increases with context length. A 200k-token prompt is slower than a 2k-token one.
  • Quality degrades with noise. Models perform better when given the 3 most relevant paragraphs than when given 300 pages and asked to find the needle.

RAG retrieves only what's relevant — typically 3–10 chunks — keeping cost, latency, and noise low.

The Pipeline

Documents → Chunk → Embed → Store in vector DB

Query → Embed → Similarity search → Top-k chunks → LLM → Answer

Every RAG implementation follows this shape. We'll use:

  • ChromaDB — a local, in-process vector database. No server, no API key, no Docker.
  • OpenAI text-embedding-3-small — fast, cheap, good quality. (Or use a local embedding model via Ollama.)
  • Claude as the reasoning model.

Install Dependencies

Terminal
pip install chromadb anthropic openai

Step 1: Chunk Your Documents

Chunking strategy affects quality more than most developers expect. A few rules of thumb:

  • Chunks should be semantically coherent — prefer paragraph boundaries over fixed character counts.
  • Overlap between chunks (e.g. 50 tokens) prevents splitting a sentence that spans a boundary.
  • For most use cases, 256–512 tokens per chunk is a reasonable starting point.
def chunk_text(text: str, chunk_size: int = 400, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by approximate token count."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

Step 2: Embed and Store in ChromaDB

import chromadb
from openai import OpenAI
 
openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("knowledge_base")
 
def embed(texts: list[str]) -> list[list[float]]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]
 
def ingest_document(doc_id: str, text: str) -> None:
    """Chunk, embed, and store a document."""
    chunks = chunk_text(text)
    embeddings = embed(chunks)
    collection.add(
        ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
        documents=chunks,
        embeddings=embeddings,
        metadatas=[{"source": doc_id} for _ in chunks],
    )
    print(f"Ingested {len(chunks)} chunks from {doc_id}")

PersistentClient writes to disk — your vector index survives process restarts.

Step 3: Retrieve Relevant Chunks

def retrieve(query: str, n_results: int = 4) -> list[str]:
    """Return the top-k most relevant chunks for a query."""
    query_embedding = embed([query])[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
    )
    return results["documents"][0]  # list of chunk strings

Step 4: Wire Retrieval as a Claude Tool

Expose retrieve as a tool so the agent decides when to consult the knowledge base:

import anthropic
 
claude = anthropic.Anthropic()
 
tools = [
    {
        "name": "search_knowledge_base",
        "description": (
            "Search the internal knowledge base for information relevant to the query. "
            "Use this before answering any question that may require specific facts, "
            "documentation, or internal procedures."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query — phrase it as a question or key concept.",
                },
            },
            "required": ["query"],
        },
    }
]
 
def run_agent(user_question: str) -> str:
    messages = [{"role": "user", "content": user_question}]
 
    while True:
        response = claude.messages.create(
            model="claude-3-7-sonnet-20250219",
            max_tokens=1024,
            tools=tools,
            messages=messages,
        )
 
        if response.stop_reason == "end_turn":
            return response.content[0].text
 
        if response.stop_reason == "tool_use":
            tool_use = next(b for b in response.content if b.type == "tool_use")
            chunks = retrieve(tool_use.input["query"])
            context = "\n\n---\n\n".join(chunks)
 
            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [
                    {
                        "type": "tool_result",
                        "tool_use_id": tool_use.id,
                        "content": context,
                    }
                ],
            })

The agent decides when retrieval is needed. If the answer is obvious from context, it skips the tool call entirely — no wasted embedding calls.

Putting It Together

# Ingest some documents once
ingest_document("onboarding", open("docs/onboarding.txt").read())
ingest_document("api_reference", open("docs/api.txt").read())
 
# Then answer questions
answer = run_agent("What's the process for resetting a user's 2FA?")
print(answer)

When to Upgrade to a Hosted Vector Store

The ChromaDB setup above handles millions of vectors comfortably on a laptop. Upgrade when:

  • You need multi-tenant isolation per user or workspace (Pinecone namespaces, pgvector row-level security)
  • You need real-time updates to the index without a restart (Weaviate, Qdrant)
  • You're running on serverless infrastructure where local disk isn't persistent (Pinecone, Turso)

The retrieval function signature (query: str → list[str]) stays identical regardless of the backend — swapping is a one-line change.

What's Next

Tomorrow we connect multiple specialised agents together using LangGraph — a graph-based orchestration library that makes multi-agent pipelines easy to reason about and extend.