By the end of this tutorial, you’ll have a working production RAG pipeline with Claude that ingests PDFs, chunks and embeds them, stores vectors in a local database, and answers questions with grounded citations — deployable in under 60 minutes. No LangChain magic you can’t debug, no black-box abstractions. Just Python, anthropic, chromadb, and pypdf.
This is the stack I’d actually use for a first production deployment: minimal dependencies, observable at every step, and extensible without a full rewrite when requirements change.
- Install dependencies — Set up Python environment with the four libraries you actually need
- Extract text from PDFs — Parse PDFs and handle encoding edge cases cleanly
- Chunk the text — Split with overlap so context doesn’t get cut at chunk boundaries
- Generate embeddings — Use OpenAI’s
text-embedding-3-smallto vectorize chunks - Store vectors in ChromaDB — Persist the index locally with metadata attached
- Build the retrieval function — Query by semantic similarity and return ranked chunks
- Wire up the Claude agent — Feed retrieved context into Claude with a grounded system prompt
Step 1: Install Dependencies
Keep the dependency surface small. You need four packages: anthropic for the Claude API, chromadb for local vector storage, pypdf for PDF parsing, and openai for embeddings (OpenAI’s embedding models are better value per token than Voyage for most use cases).
pip install anthropic chromadb pypdf openai python-dotenv
Pin your versions in requirements.txt. ChromaDB in particular has had breaking API changes between minor versions:
anthropic==0.34.2
chromadb==0.5.5
pypdf==4.3.1
openai==1.45.0
python-dotenv==1.0.1
Step 2: Extract Text from PDFs
PDF parsing fails more than you’d expect. Scanned documents return empty strings. Some PDFs have ligatures that mangle search. Handle these upfront rather than debugging silent failures downstream.
from pypdf import PdfReader
from pathlib import Path
def extract_pdf_text(pdf_path: str) -> list[dict]:
"""
Returns a list of dicts: {"page": int, "text": str, "source": str}
Skips pages with fewer than 50 characters (likely scanned/blank).
"""
reader = PdfReader(pdf_path)
source_name = Path(pdf_path).stem
pages = []
for i, page in enumerate(reader.pages):
text = page.extract_text() or ""
text = text.strip()
if len(text) < 50:
# Warn but don't crash — scanned pages are common
print(f"Warning: page {i+1} of {source_name} may be scanned or empty")
continue
pages.append({
"page": i + 1,
"text": text,
"source": source_name
})
return pages
Step 3: Chunk the Text
This is the step most tutorials get wrong. Naive sentence splitting at 512 tokens destroys context when a sentence spans the end of one chunk and the start of another. Use a sliding window with overlap — 20% overlap is a reasonable starting point, tweak based on your document type.
def chunk_text(pages: list[dict], chunk_size: int = 600, overlap: int = 120) -> list[dict]:
"""
Splits page text into overlapping chunks.
chunk_size and overlap are in characters, not tokens.
Rule of thumb: 1 token ≈ 4 chars, so 600 chars ≈ 150 tokens.
"""
chunks = []
for page in pages:
text = page["text"]
start = 0
while start < len(text):
end = start + chunk_size
chunk_text = text[start:end]
# Don't create tiny trailing chunks
if len(chunk_text) < 50:
break
chunks.append({
"text": chunk_text,
"source": page["source"],
"page": page["page"],
"chunk_id": f"{page['source']}_p{page['page']}_c{len(chunks)}"
})
start += chunk_size - overlap # slide forward with overlap
return chunks
For legal docs or dense technical manuals, I’d push chunk_size up to 900-1000 chars. For FAQ-style docs with short answers, drop to 300. There’s no universal right answer — retrieval quality is the test. See our guide on semantic search and embeddings for how to benchmark retrieval quality on your specific corpus.
Step 4: Generate Embeddings
Using text-embedding-3-small at $0.02 per million tokens. Embedding a 100-page PDF with ~300 chunks costs less than $0.001. Run this once per document and persist.
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def embed_chunks(chunks: list[dict], batch_size: int = 100) -> list[dict]:
"""
Adds 'embedding' key to each chunk dict.
Batches requests to stay under API limits.
"""
texts = [c["text"] for c in chunks]
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = oai_client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
all_embeddings.extend([e.embedding for e in response.data])
for chunk, embedding in zip(chunks, all_embeddings):
chunk["embedding"] = embedding
return chunks
Step 5: Store Vectors in ChromaDB
ChromaDB gives you persistent local vector storage with no infrastructure overhead. For production with 100k+ chunks you’d want Qdrant or Pinecone, but for most deployments ChromaDB handles it fine and is significantly easier to operate.
import chromadb
def build_vector_store(chunks: list[dict], collection_name: str = "knowledge_base") -> chromadb.Collection:
"""
Creates or loads a persistent ChromaDB collection.
Re-running with the same collection_name will skip existing IDs.
"""
client = chromadb.PersistentClient(path="./chroma_db")
# get_or_create avoids duplicate errors on re-runs
collection = client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # cosine similarity for text
)
# Filter out chunks already in the collection
existing_ids = set(collection.get()["ids"])
new_chunks = [c for c in chunks if c["chunk_id"] not in existing_ids]
if not new_chunks:
print("All chunks already indexed.")
return collection
collection.add(
ids=[c["chunk_id"] for c in new_chunks],
embeddings=[c["embedding"] for c in new_chunks],
documents=[c["text"] for c in new_chunks],
metadatas=[{"source": c["source"], "page": c["page"]} for c in new_chunks]
)
print(f"Indexed {len(new_chunks)} new chunks.")
return collection
Step 6: Build the Retrieval Function
This is the “R” in RAG. Query the vector store with an embedded version of the user’s question, return the top-k most similar chunks. The n_results parameter is a tuning knob — more results means more context for Claude but higher token cost per query.
def retrieve_context(
query: str,
collection: chromadb.Collection,
n_results: int = 5
) -> list[dict]:
"""
Returns top-n relevant chunks for a query.
Each result includes text, source filename, and page number.
"""
# Embed the query using the same model as the documents
response = oai_client.embeddings.create(
model="text-embedding-3-small",
input=[query]
)
query_embedding = response.data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
chunks = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
chunks.append({
"text": doc,
"source": meta["source"],
"page": meta["page"],
"similarity": 1 - dist # convert cosine distance to similarity
})
return chunks
Filter out chunks below a similarity threshold (I use 0.4 as a floor) to avoid injecting irrelevant noise into the prompt. Hallucination from low-quality retrieval is a bigger problem than hallucination from the model itself — worth reading our strategies for reducing LLM hallucinations in production if this is a concern for your use case.
Step 7: Wire Up the Claude Agent
Now the production RAG pipeline Claude integration. The key is a system prompt that explicitly tells Claude to cite sources and refuse to answer if the context doesn’t contain relevant information. This single instruction does more to prevent hallucination than any post-processing filter.
import anthropic
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
SYSTEM_PROMPT = """You are a knowledgeable assistant with access to a document knowledge base.
Rules:
1. Answer ONLY based on the provided context chunks.
2. Always cite your sources using [Source: filename, p.X] format.
3. If the context doesn't contain enough information to answer, say so explicitly.
4. Do not infer or extrapolate beyond what the documents state."""
def ask_claude(query: str, collection: chromadb.Collection) -> str:
"""
Full RAG query: retrieve relevant chunks, then ask Claude.
"""
# Retrieve relevant chunks
chunks = retrieve_context(query, collection, n_results=5)
# Filter weak matches
strong_chunks = [c for c in chunks if c["similarity"] > 0.4]
if not strong_chunks:
return "I couldn't find relevant information in the knowledge base for that question."
# Format context for the prompt
context_str = "\n\n".join([
f"[Source: {c['source']}, p.{c['page']}]\n{c['text']}"
for c in strong_chunks
])
response = claude.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=SYSTEM_PROMPT,
messages=[
{
"role": "user",
"content": f"Context:\n{context_str}\n\nQuestion: {query}"
}
]
)
return response.content[0].text
Cost estimate at current pricing: each query embeds ~50 tokens + 5 chunks of ~150 tokens = ~800 tokens context, plus the user question. With Claude 3.5 Sonnet at $3/M input tokens, a typical query costs around $0.003. At scale, swap to Claude Haiku — you’ll drop to roughly $0.0003 per query for most retrieval tasks. See our breakdown of when the Claude API vs Agent SDK makes sense for higher-volume deployments.
Putting It All Together
def build_knowledge_base(pdf_paths: list[str]) -> chromadb.Collection:
"""Index multiple PDFs into a single ChromaDB collection."""
all_chunks = []
for path in pdf_paths:
print(f"Processing {path}...")
pages = extract_pdf_text(path)
chunks = chunk_text(pages)
embedded = embed_chunks(chunks)
all_chunks.extend(embedded)
return build_vector_store(all_chunks)
if __name__ == "__main__":
# Build the index (run once, or incrementally as new PDFs arrive)
collection = build_knowledge_base([
"docs/product_manual.pdf",
"docs/faq.pdf",
"docs/technical_spec.pdf"
])
# Interactive query loop
while True:
query = input("\nAsk a question (or 'quit'): ").strip()
if query.lower() == "quit":
break
answer = ask_claude(query, collection)
print(f"\n{answer}")
Common Errors and How to Fix Them
Empty or garbled text from PDFs
Symptom: Chunks contain mostly whitespace, random characters, or are suspiciously short. Cause: Scanned PDFs, password-protected files, or PDFs built from images. Fix: Check len(text) < 50 as shown above. For scanned docs you’ll need OCR — pytesseract or cloud Vision APIs. For password protection, pypdf supports a password argument in PdfReader.
ChromaDB “duplicate ID” errors on re-runs
Symptom: chromadb.errors.DuplicateIDError when you run the indexer twice. Fix: The get_or_create_collection + existing ID filter in Step 5 handles this. If you’ve already got a broken collection, delete the ./chroma_db directory and re-index.
Retrieval returns irrelevant chunks
Symptom: Claude answers questions with context that’s clearly off-topic. Cause: Either your similarity threshold is too low, your chunks are too large (diluting signal), or the question phrasing doesn’t match the document vocabulary. Fix: Lower chunk_size to 400, raise the similarity filter to 0.5, and consider adding a where filter in ChromaDB to scope retrieval to specific documents. For persistent retrieval issues, adding retry logic with query reformulation is worth the 30 minutes it takes to implement.
What to Build Next
The natural extension here is multi-document routing: instead of querying a single collection, build a lightweight classifier that identifies which document category a question belongs to (product docs vs. legal docs vs. support FAQs), then routes to the appropriate ChromaDB collection before retrieval. This dramatically improves precision when your corpus spans very different domains. Combine it with Claude’s tool use to let the agent decide which collection to search based on the question — that’s the point where your RAG system starts to feel genuinely agentic rather than a glorified search box.
Bottom Line: Who Should Use This Stack
Solo founders and small teams: This exact setup — ChromaDB local + text-embedding-3-small + Claude Sonnet — is appropriate up to about 50,000 chunks (~500 dense PDFs). Total infrastructure cost is zero beyond API calls. Start here.
Teams with 100k+ chunks or multi-user concurrency: Swap ChromaDB for Qdrant (self-hosted) or Pinecone. Keep the same chunking and retrieval logic — it’s portable. The vector store is the only thing that changes.
Budget-sensitive deployments: Claude Haiku cuts per-query cost by 10x with acceptable quality degradation on most retrieval-augmented tasks. Run both on 50 representative queries and compare — for most document Q&A use cases, Haiku is good enough.
The production RAG pipeline Claude stack here is deliberately boring: no frameworks, no magic, nothing you can’t step through with a debugger. That’s intentional. The complexity should live in your data and your prompts, not in your infrastructure.
Frequently Asked Questions
What’s the best chunk size for a RAG pipeline?
There’s no universal answer — it depends on your document structure and query type. Start with 600 characters (~150 tokens) with 20% overlap, then benchmark retrieval precision on 20-30 representative queries. Dense technical documents often need larger chunks (900-1000 chars); FAQ-style docs work better at 300 chars. Chunk size affects retrieval far more than model choice.
Do I need OpenAI embeddings if I’m using Claude?
No — you can use any embedding model. text-embedding-3-small is a good default for cost and performance. Alternatives include Voyage AI’s voyage-3-lite (Anthropic-recommended), Cohere embeddings, or open-source models like nomic-embed-text via Ollama if you need fully local inference. The embedding model and the generation model are completely independent — mix and match freely.
Can I use this RAG setup with multiple PDF files?
Yes — the build_knowledge_base function in Step 7 accepts a list of paths. Each chunk stores its source filename and page number as metadata, so Claude can cite specific documents in its answers. For very large corpora (500+ PDFs), run indexing as a background job and add the existing_ids filter to make it incremental.
How do I prevent Claude from hallucinating in a RAG pipeline?
Two things matter most: quality of retrieval and explicit instructions in the system prompt. Instruct Claude to answer only from provided context and to say “I don’t know” when context is insufficient. Then filter out low-similarity chunks before they reach the prompt — injecting irrelevant context is the primary driver of hallucination in RAG systems, not the model’s baseline behavior.
Is ChromaDB suitable for production use?
For collections under ~50,000 vectors and single-server deployments, yes. It’s persistent, fast, and requires no infrastructure beyond a filesystem. For multi-user APIs, high concurrency, or larger corpora, move to Qdrant (self-hosted, excellent performance) or Pinecone (managed, zero ops). The migration is straightforward because your chunking and embedding code stays identical.
How much does running this RAG pipeline cost?
Indexing a 100-page PDF costs under $0.001 in embedding API calls using text-embedding-3-small. Each query costs roughly $0.003 with Claude 3.5 Sonnet (embedding + ~800 token context + response) or ~$0.0003 with Claude Haiku. A typical internal tool serving 1,000 queries/day runs under $3/day on Sonnet, under $0.30/day on Haiku.
Put this into practice
Try the Connection Agent agent — ready to use, no setup required.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

