Building a Production RAG Pipeline: Vector Embeddings, Retrieval, and Claude Integration

Most RAG implementations fail not because the concept is wrong, but because developers skip the boring parts: chunking strategy, embedding model selection, retrieval ranking, and context assembly. The result is a RAG pipeline Claude integration that retrieves the wrong documents 40% of the time and hallucinates the rest. This guide walks through building one that actually works — with real code, real tradeoffs, and the specific decisions that separate a demo from a production system.

We’ll cover: choosing and generating embeddings, setting up a vector store, implementing hybrid retrieval with reranking, and wiring it to Claude’s API with properly structured context. By the end, you’ll have a working pipeline you can drop into a real product.

Why Most RAG Systems Underperform in Production

The naive approach is: chunk your docs, embed them, store in a vector DB, retrieve top-k by cosine similarity, stuff into a prompt. This works in demos. It fails in production because:

Fixed chunk sizes ignore document structure. A 512-token chunk that cuts a table in half is useless.
Cosine similarity alone is brittle. Semantic search misses exact keyword matches that users actually care about.
No reranking means garbage in, garbage out. The top-5 by embedding similarity often isn’t the best 5 for the actual query.
Context assembly is an afterthought. Dumping retrieved chunks into a prompt without structure confuses even strong models.

None of this is theoretical. These are the failure modes you hit when you move beyond 10 test documents and real users start asking unpredictable questions.

Step 1 — Chunking Strategy That Doesn’t Lose Context

Before you touch embeddings, get your chunking right. I’d recommend a hybrid approach: split on semantic boundaries first (paragraphs, headings, code blocks), then enforce a max token limit as a fallback. Use overlapping windows so context isn’t lost at chunk edges.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    # Try paragraph breaks, then sentences, then characters
    separators=["\n\n", "\n", ".", " ", ""],
    chunk_size=512,       # tokens, not characters
    chunk_overlap=64,     # overlap to preserve context across boundaries
    length_function=len,  # swap this for a tokenizer-based counter in prod
)

chunks = splitter.split_text(document_text)

# Always store the source metadata alongside the chunk
documents = [
    {"text": chunk, "source": doc_id, "chunk_index": i}
    for i, chunk in enumerate(chunks)
]

For structured documents (PDFs with tables, code files, markdown), write custom splitters. The 20 minutes you spend on this saves hours of debugging retrieval failures later.

Step 2 — Choosing and Generating Embeddings

You have three realistic options: OpenAI’s text-embedding-3-small, Cohere’s embed-english-v3.0, or an open-source model like BAAI/bge-large-en-v1.5 running locally. Here’s the honest breakdown:

OpenAI text-embedding-3-small: $0.02 per million tokens. Best default choice if you’re already in the OpenAI ecosystem. 1536 dimensions. Good general performance.
Cohere embed-v3: Slightly better retrieval benchmarks on specialized domains. Supports input types (search_document vs search_query) which actually matters for recall. $0.10 per million tokens for the English model.
BGE-large local: Zero marginal cost after infrastructure. Roughly competitive with the hosted options on English text. Worth it if you’re embedding millions of documents or have privacy requirements.

My pick for most production use cases: OpenAI text-embedding-3-small. The cost is negligible at typical document volumes, latency is predictable, and you get dimension reduction support down to 256 if you need it.

from openai import OpenAI

client = OpenAI()

def embed_documents(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    """Embed in batches to avoid rate limits and payload size issues."""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        # Response maintains order, so this is safe
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    
    return all_embeddings

def embed_query(query: str) -> list[float]:
    """Single embedding for a search query — same model as documents."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[query],
    )
    return response.data[0].embedding

Step 3 — Vector Store Setup with Qdrant

For a production RAG pipeline, I’d use Qdrant over Pinecone for self-hosted flexibility, or Pinecone if you want zero infrastructure. Chroma is fine for local development but don’t run it in production — it has consistency issues under concurrent writes.

Here’s a Qdrant setup that includes the payload filtering you’ll actually need in production (filtering by source, date, user permissions, etc.):

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

client = QdrantClient(url="http://localhost:6333")  # or your cloud endpoint

COLLECTION_NAME = "knowledge_base"
VECTOR_SIZE = 1536  # match your embedding model dimensions

# Create collection — only needed once
client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
)

def upsert_documents(documents: list[dict], embeddings: list[list[float]]):
    """Store documents with their embeddings and metadata."""
    points = [
        PointStruct(
            id=str(uuid.uuid4()),
            vector=embedding,
            payload={
                "text": doc["text"],
                "source": doc["source"],
                "chunk_index": doc["chunk_index"],
            }
        )
        for doc, embedding in zip(documents, embeddings)
    ]
    
    # Upsert in batches of 100 to avoid timeouts
    for i in range(0, len(points), 100):
        client.upsert(
            collection_name=COLLECTION_NAME,
            points=points[i:i + 100],
        )

def vector_search(query_embedding: list[float], top_k: int = 20) -> list[dict]:
    """Retrieve candidates — we'll rerank these before sending to Claude."""
    results = client.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_embedding,
        limit=top_k,
        with_payload=True,
    )
    return [
        {"text": r.payload["text"], "source": r.payload["source"], "score": r.score}
        for r in results
    ]

Note that we’re fetching top-20, not top-5. We’ll cut this down after reranking. Fetching a larger candidate set before reranking is one of the highest-impact changes you can make to retrieval quality.

Adding Hybrid Search — The Part Most Guides Skip

Pure vector search misses keyword matches. A user asking about “PCI DSS requirement 6.4.2” expects that exact term to matter. Hybrid search combines dense vector retrieval with sparse BM25 keyword search, then merges results using Reciprocal Rank Fusion (RRF).

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, documents: list[dict]):
        self.documents = documents
        # Tokenize for BM25 — simple whitespace split is fine for English
        tokenized = [doc["text"].lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
    
    def bm25_search(self, query: str, top_k: int = 20) -> list[tuple[int, float]]:
        tokens = query.lower().split()
        scores = self.bm25.get_scores(tokens)
        # Return (doc_index, score) sorted by score descending
        ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
        return ranked[:top_k]
    
    def reciprocal_rank_fusion(
        self,
        vector_results: list[dict],
        bm25_results: list[tuple[int, float]],
        k: int = 60  # RRF constant — 60 is the standard default
    ) -> list[dict]:
        scores = {}
        
        for rank, result in enumerate(vector_results):
            doc_id = result["source"] + str(result.get("chunk_index", 0))
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
        
        for rank, (doc_idx, _) in enumerate(bm25_results):
            doc = self.documents[doc_idx]
            doc_id = doc["source"] + str(doc.get("chunk_index", 0))
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
        
        # Sort by fused score
        sorted_ids = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        
        # Return top results — rebuild from document store
        result_map = {
            r["source"] + str(r.get("chunk_index", 0)): r
            for r in vector_results
        }
        return [result_map[doc_id] for doc_id, _ in sorted_ids if doc_id in result_map]

Reranking Before Sending to Claude

After hybrid retrieval, run a cross-encoder reranker to cut your candidate pool from 20 to 5. Cohere’s Rerank API is the easiest option here at $2 per thousand calls — for most use cases that’s pennies. Alternatively, run a local cross-encoder with cross-encoder/ms-marco-MiniLM-L-6-v2 from Hugging Face.

import cohere

co = cohere.Client("your-api-key")

def rerank_results(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    """Use Cohere Rerank to select the best chunks from candidates."""
    if not candidates:
        return []
    
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=[c["text"] for c in candidates],
        top_n=top_n,
    )
    
    # Map back to our candidate dicts using the returned indices
    return [candidates[r.index] for r in response.results]

Wiring It to Claude — Context Assembly That Actually Works

This is where your RAG pipeline Claude integration comes together. The key insight: don’t just dump chunks into the prompt. Structure the context so Claude knows what each piece is, where it came from, and how confident it should be.

import anthropic

claude = anthropic.Anthropic()

def build_context_block(retrieved_chunks: list[dict]) -> str:
    """Format retrieved chunks into structured context for Claude."""
    context_parts = []
    for i, chunk in enumerate(retrieved_chunks, 1):
        context_parts.append(
            f"[Source {i}: {chunk['source']}]\n{chunk['text']}"
        )
    return "\n\n---\n\n".join(context_parts)

def query_with_rag(user_query: str, retriever: HybridRetriever) -> str:
    # 1. Embed the query
    query_embedding = embed_query(user_query)
    
    # 2. Hybrid retrieval — get 20 candidates
    vector_results = vector_search(query_embedding, top_k=20)
    bm25_results = retriever.bm25_search(user_query, top_k=20)
    fused = retriever.reciprocal_rank_fusion(vector_results, bm25_results)
    
    # 3. Rerank to top 5
    top_chunks = rerank_results(user_query, fused, top_n=5)
    
    # 4. Build structured context
    context = build_context_block(top_chunks)
    
    # 5. Call Claude with explicit instructions for grounded responses
    response = claude.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system="""You are a precise assistant. Answer questions using ONLY the provided context.
If the context doesn't contain enough information to answer confidently, say so explicitly.
Always cite which source number(s) support your answer.""",
        messages=[
            {
                "role": "user",
                "content": f"Context:\n\n{context}\n\n---\n\nQuestion: {user_query}"
            }
        ]
    )
    
    return response.content[0].text

The system prompt matters here. Telling Claude to cite sources and acknowledge gaps dramatically reduces hallucination — and the citations let you build a feedback loop to measure retrieval quality in production.

Production Considerations You’ll Hit Quickly

Latency

A full pipeline run (embed → retrieve → rerank → LLM) takes 1.5–4 seconds depending on your infrastructure. The reranker API call is often the bottleneck. If you need sub-2s latency, skip the external reranker and use a local cross-encoder, or limit your candidate pool to 10 instead of 20.

Embedding Drift

If you switch embedding models later, you must re-embed your entire document corpus. Pin your embedding model version in production and document it explicitly. OpenAI’s text-embedding-3-small is stable but they’ve shipped breaking changes in the past under different model names.

Cost Estimate

For a typical knowledge base query at current pricing: embedding one query costs ~$0.000002 (negligible), Cohere rerank costs ~$0.002 per call, Claude Opus input/output for a 5-chunk context costs roughly $0.015–0.025 per query. Total: under $0.03 per query. At 10,000 queries/month you’re looking at ~$300 on LLM costs, dominated by Claude’s output tokens.

Observability

Log the retrieved chunks and their scores for every query. You can’t improve what you don’t measure. A simple Postgres table with (query, retrieved_sources, response, user_feedback) will tell you more about your retrieval quality than any benchmark after two weeks of real traffic.

When to Use This Stack and When Not To

Use this approach if: you’re building a knowledge base assistant, document Q&A, or any system where accuracy on specific content matters more than creative generation. This is the right architecture for internal wikis, legal document search, technical support bots, and compliance tooling.

Don’t use it if: your documents change faster than you can re-embed them (consider streaming ingestion pipelines instead), you have fewer than ~500 documents (just use a long-context prompt), or your queries are so open-ended that retrieval can’t meaningfully narrow the search space.

Solo founder on a budget: Use OpenAI embeddings + Qdrant Cloud free tier + skip the reranker for your first 1,000 users. Add Cohere reranking when retrieval quality becomes a visible problem. Team building a production product: Implement the full hybrid + rerank pipeline from day one — the cost is trivial and retrofitting it later is painful. Use Claude Haiku for low-stakes queries to cut costs 10x when you’re volume-sensitive.

A well-tuned RAG pipeline Claude integration doesn’t require exotic tooling — it requires getting the fundamentals right. Chunking, hybrid retrieval, reranking, and structured context assembly are the levers that actually move the needle. Ship something simple, instrument it heavily, and iterate on the parts that are actually failing — that’s the loop that gets you to production.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building a Production RAG Pipeline: Vector Embeddings, Retrieval, and Claude Integration

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building a Production RAG Pipeline: Vector Embeddings, Retrieval, and Claude Integration

Why Most RAG Systems Underperform in Production

Step 1 — Chunking Strategy That Doesn’t Lose Context

Step 2 — Choosing and Generating Embeddings

Step 3 — Vector Store Setup with Qdrant

Adding Hybrid Search — The Part Most Guides Skip

Reranking Before Sending to Claude

Wiring It to Claude — Context Assembly That Actually Works

Production Considerations You’ll Hit Quickly

Latency

Embedding Drift

Cost Estimate

Observability

When to Use This Stack and When Not To

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation