By the end of this tutorial, you’ll have a working RAG pipeline for Claude agents that ingests PDFs, chunks them intelligently, embeds them into a vector store, retrieves relevant context at query time, and feeds that context into a Claude agent response. Every step includes code that actually runs — not pseudocode.
This is the pipeline I’d use for a production customer support bot, internal documentation search, or any agent that needs to answer questions grounded in your own documents. It avoids LangChain abstractions where they add friction, uses sentence-transformers for embeddings (free, fast, good enough), and ChromaDB for local vector storage that you can swap for Pinecone or Qdrant later.
- Install dependencies — set up your Python environment with the required libraries
- Extract and chunk PDFs — parse PDF text and split into semantically coherent chunks
- Generate embeddings — embed chunks using a local sentence-transformer model
- Store in ChromaDB — persist vectors with metadata for filtering
- Build the retrieval function — query the store and rank results by relevance
- Integrate with Claude — pass retrieved context into a Claude agent prompt
Step 1: Install Dependencies
You need Python 3.10+. Create a virtual environment first — don’t skip this, the dependency graph here is messy.
pip install anthropic chromadb sentence-transformers pymupdf tiktoken
What each one does: anthropic for Claude API calls, chromadb for local vector storage, sentence-transformers for generating embeddings locally (no API cost), pymupdf (imported as fitz) for PDF extraction, and tiktoken for accurate token counting when sizing chunks.
import fitz # pymupdf
import chromadb
import anthropic
import tiktoken
from sentence_transformers import SentenceTransformer
# Initialize once at module level — loading the model takes ~2s
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
tokenizer = tiktoken.get_encoding("cl100k_base")
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
Step 2: Extract and Chunk PDFs
PDF extraction is where most tutorials lie to you. Raw PDF text is messy — headers bleed into body copy, columns merge, footnotes appear mid-paragraph. PyMuPDF is the most reliable free option I’ve found for text extraction.
For chunking strategy: fixed-size with overlap beats paragraph splitting for most document types. Paragraph splits look cleaner but produce wildly uneven chunk sizes, which makes retrieval unpredictable. Use 512-token chunks with a 50-token overlap.
def extract_pdf_text(pdf_path: str) -> str:
"""Extract raw text from a PDF file using PyMuPDF."""
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text("text") # "text" mode preserves reading order
doc.close()
return text
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
"""
Split text into token-aware chunks with overlap.
overlap preserves context at chunk boundaries.
"""
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
start += chunk_size - overlap # slide window with overlap
return chunks
One thing the docs don’t tell you: PyMuPDF’s get_text("blocks") mode gives you bounding boxes per text block, which is useful if you need to preserve section structure. For most RAG use cases, the plain "text" mode is fine.
Step 3: Generate Embeddings
all-MiniLM-L6-v2 produces 384-dimensional embeddings and runs at ~2,000 sentences/second on CPU. For most document collections under 100k chunks, this is fast enough and costs $0. If you’re embedding millions of chunks or need multilingual support, look at all-mpnet-base-v2 or OpenAI’s text-embedding-3-small (~$0.02 per million tokens).
For a deeper look at embedding choices and how they affect retrieval quality, the guide on semantic search for agent knowledge bases covers tuning strategies that apply directly here.
def embed_chunks(chunks: list[str]) -> list[list[float]]:
"""
Generate embeddings for a list of text chunks.
Returns list of float vectors — each vector is 384 dims for MiniLM.
"""
# encode() handles batching internally; no need to batch manually
embeddings = embedding_model.encode(
chunks,
batch_size=64, # tune based on available RAM
show_progress_bar=True,
normalize_embeddings=True # cosine similarity works better normalized
)
return embeddings.tolist()
Step 4: Store in ChromaDB
ChromaDB is the right choice for local development and small-to-medium production deployments (under ~1M vectors). For scale beyond that, check out the Pinecone vs Qdrant vs Weaviate comparison — Qdrant wins on self-hosted performance, Pinecone on managed simplicity.
def build_knowledge_base(pdf_path: str, collection_name: str = "documents") -> chromadb.Collection:
"""
Full ingestion pipeline: PDF -> chunks -> embeddings -> ChromaDB.
Returns the collection handle for querying.
"""
chroma_client = chromadb.PersistentClient(path="./chroma_db") # persists to disk
# Delete existing collection if re-ingesting (avoids duplicate IDs)
try:
chroma_client.delete_collection(collection_name)
except Exception:
pass
collection = chroma_client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # cosine distance for normalized embeddings
)
raw_text = extract_pdf_text(pdf_path)
chunks = chunk_text(raw_text)
embeddings = embed_chunks(chunks)
# ChromaDB requires unique string IDs per document
ids = [f"chunk_{i}" for i in range(len(chunks))]
# Store metadata alongside each chunk — useful for filtering later
metadatas = [{"source": pdf_path, "chunk_index": i} for i in range(len(chunks))]
# Batch insert in groups of 500 to avoid memory issues on large docs
batch_size = 500
for i in range(0, len(chunks), batch_size):
collection.add(
ids=ids[i:i+batch_size],
embeddings=embeddings[i:i+batch_size],
documents=chunks[i:i+batch_size],
metadatas=metadatas[i:i+batch_size]
)
print(f"Ingested {len(chunks)} chunks from {pdf_path}")
return collection
Step 5: Build the Retrieval Function
Retrieval is where you decide what context Claude actually sees. The two dials that matter most: how many chunks to retrieve (top-k) and whether to re-rank them.
For a 60-minute build, retrieve top 5 chunks and skip re-ranking. In production, I’d add a cross-encoder re-ranker — it adds ~100ms latency but meaningfully improves precision on ambiguous queries.
def retrieve_context(
query: str,
collection: chromadb.Collection,
n_results: int = 5
) -> list[dict]:
"""
Embed the query and retrieve the top-n most similar chunks.
Returns list of dicts with 'text' and 'metadata'.
"""
query_embedding = embedding_model.encode(
[query],
normalize_embeddings=True
).tolist()
results = collection.query(
query_embeddings=query_embedding,
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
retrieved = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
# distance is cosine distance (0=identical, 2=opposite)
# filter out low-relevance chunks above distance threshold
if dist < 1.2:
retrieved.append({
"text": doc,
"metadata": meta,
"relevance_score": 1 - dist # convert to similarity
})
return retrieved
Step 6: Integrate with Claude
This is where the RAG pipeline Claude agents pattern pays off. You pass the retrieved chunks as context in the system prompt and let Claude answer grounded in that material — not its training data.
A critical point on hallucinations: RAG significantly reduces them, but doesn’t eliminate them. Claude will sometimes “bridge” between retrieved chunks with inferences that aren’t in the source material. Adding explicit instructions like “only answer based on the provided context” helps — see the patterns in reducing LLM hallucinations in production for a fuller treatment.
def ask_claude_with_rag(
query: str,
collection: chromadb.Collection,
model: str = "claude-haiku-4-5" # Haiku at ~$0.0008/1k input tokens
) -> str:
"""
Full RAG query: retrieve context, build prompt, call Claude.
"""
context_chunks = retrieve_context(query, collection)
if not context_chunks:
return "I couldn't find relevant information in the knowledge base."
# Format retrieved chunks into a readable context block
context_text = "\n\n---\n\n".join([
f"[Source chunk {i+1}, relevance: {c['relevance_score']:.2f}]\n{c['text']}"
for i, c in enumerate(context_chunks)
])
system_prompt = """You are a precise assistant that answers questions based strictly on provided context.
Rules:
- Only use information from the CONTEXT block below
- If the context doesn't contain the answer, say so explicitly
- Quote or reference specific parts of the context when answering
- Do not add information from your training data"""
user_message = f"""CONTEXT:
{context_text}
QUESTION:
{query}"""
response = client.messages.create(
model=model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
# Usage — wire it all together
if __name__ == "__main__":
# One-time ingestion (skip this step if collection already exists)
collection = build_knowledge_base("your_document.pdf")
# Query the RAG pipeline
answer = ask_claude_with_rag(
"What are the main risk factors mentioned in the document?",
collection
)
print(answer)
At current Haiku pricing (~$0.0008 per 1K input tokens), a typical RAG query with 5 context chunks averages around 1,500-2,000 input tokens — roughly $0.0013-0.0016 per query. At 10,000 queries/day that’s under $20/month. If you need faster responses or lower cost on high-volume workloads, the GPT-5.4 mini vs Claude Haiku comparison has current benchmarks worth checking.
Common Errors and How to Fix Them
Error 1: ChromaDB “Embedding dimension mismatch”
This happens when you change embedding models after a collection already exists. ChromaDB stores the dimension with the collection and rejects mismatched inserts. Fix: delete the old collection (or the entire ./chroma_db directory) and re-ingest. In production, encode the model name in the collection name: documents_minilm_v6.
Error 2: PyMuPDF extraction returns garbled text from scanned PDFs
PyMuPDF extracts embedded text. Scanned PDFs are images — there’s no text layer. If page.get_text("text") returns empty strings or garbage, you’re dealing with a scan. You need OCR: pytesseract or AWS Textract for production quality. Check with: if len(raw_text.strip()) < 100: raise ValueError("PDF appears to be scanned — OCR required").
Error 3: Claude returns “I don’t have information about that” despite relevant chunks existing
Two likely causes. First, your relevance threshold is too strict — lower the distance filter from 1.2 to 1.5. Second, the query and the relevant text use different terminology (e.g., query says “revenue” but document says “income”). Fix: add a query expansion step where you prompt Claude to generate 2-3 synonymous queries and merge the retrieval results. This is also covered in the semantic search implementation guide.
What to Build Next
Add multi-document support with metadata filtering. The current pipeline handles one PDF. Extend it by ingesting multiple documents and storing {"source": filename, "doc_type": "contract"} metadata per chunk. Then let users filter by document type at query time: collection.query(where={"doc_type": "contract"}, ...). This turns your single-document RAG pipeline into a full knowledge base that can answer “what do our contracts say about termination clauses?” across a 500-document corpus — which is where this pattern gets genuinely useful in production.
If you’re building agents that need to persist this knowledge across sessions rather than re-querying a static collection, the persistent memory architecture guide covers how to layer episodic memory on top of this kind of vector store setup.
When to Use This Approach
Solo founders and small teams: This stack (ChromaDB + MiniLM + Haiku) is the right starting point. Zero infrastructure cost, runs locally, and the ingestion/query code is under 200 lines. Start here, add Pinecone when you hit 500k+ chunks or need multi-region replication.
Enterprise teams: Swap ChromaDB for Qdrant (self-hosted) or Pinecone Serverless, replace MiniLM with text-embedding-3-large for better recall on technical documents, and add a cross-encoder re-ranker. Also add fallback and retry logic around both the embedding calls and the Claude API calls — embedding services go down and you don’t want that to take out your entire RAG pipeline.
Budget-conscious builders: The MiniLM embedding model is free and runs on CPU. The only cost is Claude API calls. At Haiku pricing, you can run ~15,000 RAG queries for $20. The entire pipeline described here, for a typical small business knowledge base, costs less than a Heroku dyno per month.
Frequently Asked Questions
What chunk size should I use for a RAG pipeline?
512 tokens with 50-token overlap works well for most document types. Go smaller (256 tokens) for dense technical content where precision matters more than context breadth. Go larger (1024 tokens) for narrative documents like contracts or reports where answers require surrounding context to make sense. Avoid chunks under 100 tokens — they rarely contain enough meaning to be useful.
Can I use Claude’s own API for embeddings instead of sentence-transformers?
Anthropic doesn’t currently offer a dedicated embeddings API — Claude is a generation model, not an embedding model. Your options are sentence-transformers (free, local), OpenAI’s text-embedding-3-small (~$0.02/million tokens), or Cohere’s embed models. For most production RAG pipelines, OpenAI’s embedding API is the simplest managed option if you don’t want to host a model yourself.
How do I handle PDFs with tables and charts in a RAG pipeline?
Standard text extraction ignores tables entirely or produces garbled output. For tables, use pymupdf‘s page.get_text("blocks") and look for blocks where text is arranged in rows, or use a dedicated library like camelot-py for structured table extraction. For charts, you need a vision model — either Claude’s vision API or a specialized chart-to-text model. Most RAG pipelines just skip charts or log them as gaps in coverage.
How many documents can ChromaDB handle before I need to switch to Pinecone or Qdrant?
ChromaDB handles up to roughly 1-2 million vectors reliably on a standard server with 16GB RAM. Beyond that, query latency starts climbing. The real trigger to switch isn’t usually size — it’s operational requirements: if you need horizontal scaling, multi-region replication, or a managed SLA, move to Pinecone Serverless or a hosted Qdrant instance before you hit the size limit.
What’s the difference between RAG and fine-tuning for giving Claude domain knowledge?
RAG retrieves relevant text at query time and puts it in the context window — the model’s weights never change. Fine-tuning bakes knowledge into the model’s weights permanently. Use RAG when your knowledge base changes frequently, when you need to cite sources, or when you want to update documents without retraining. Fine-tuning is better for adjusting tone, format, or domain-specific reasoning patterns that don’t change often. For most production use cases, RAG is faster to build, cheaper to maintain, and easier to debug.
Put this into practice
Try the Connection Agent agent — ready to use, no setup required.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

