Hybrid Search for RAG: Combining Dense and Keyword-Based Retrieval

Q: What is Reciprocal Rank Fusion and why use it over score normalization?

RRF combines rankings by summing 1 / (k + rank) across retrievers rather than normalizing and averaging raw scores. This matters because BM25 scores are unbounded and query-dependent, while cosine similarity is bounded at 1.0 — normalizing them is mathematically tricky and brittle. RRF sidesteps this entirely by only caring about relative order, not magnitude, making it robust across very different scoring scales.

Q: Can I use OpenAI embeddings instead of sentence-transformers for the dense retriever?

Yes — replace the model.encode() calls with OpenAI's text-embedding-3-small API. It costs $0.02 per million tokens at current pricing, which is cheap at document-indexing time but adds up if you re-embed frequently. For the query embedding (once per search), cost is negligible. Local sentence-transformers models are free after the initial download and perform comparably on most English corpora, so I'd default to those unless you have a specific reason to need OpenAI's embeddings.

By the end of this tutorial, you’ll have a working hybrid search pipeline that combines BM25 keyword matching with dense vector retrieval, fused using Reciprocal Rank Fusion (RRF), ready to drop into any RAG system backed by Claude. The improvement over pure vector search is not marginal — on domain-specific corpora with exact product names, error codes, or medical terminology, hybrid search RAG retrieval consistently outperforms either approach alone by 15-30% on recall@10.

The core problem: dense embeddings are excellent at semantic similarity but notoriously bad at exact token matching. Ask a vector-only system for “error code E4023” and it’ll surface documents that are semantically adjacent to error codes — not necessarily the one containing “E4023”. BM25 finds it instantly. But ask BM25 about “memory leaks in concurrent Python applications” without those exact words, and it fails. You need both. If you’re building on top of a RAG pipeline, check this RAG pipeline from scratch guide first if you haven’t set up the base layer yet.

Install dependencies — Set up rank-bm25, sentence-transformers, and supporting libraries
Build the BM25 index — Tokenize and index your document corpus
Build the dense vector index — Embed documents with sentence-transformers
Implement Reciprocal Rank Fusion — Merge ranked results from both retrievers
Wire it into Claude — Feed fused results as RAG context to Claude’s API
Tune and evaluate — Adjust RRF k-parameter and measure retrieval quality

Step 1: Install Dependencies

You need four packages. rank-bm25 for keyword retrieval, sentence-transformers for dense embeddings, anthropic for Claude, and numpy for score fusion math. Optionally, faiss-cpu for fast approximate nearest neighbour search at scale.

pip install rank-bm25 sentence-transformers anthropic numpy faiss-cpu

At time of writing, sentence-transformers==2.7.0 and rank-bm25==0.2.2 are stable. Pin these. The sentence-transformers API has broken silently between minor versions more than once.

Step 2: Build the BM25 Index

BM25 operates on tokenized text. The tokenization quality matters more than most tutorials admit — simple whitespace splitting works, but a proper tokenizer that lowercases and strips punctuation gives meaningfully better results.

from rank_bm25 import BM25Okapi
import re

def tokenize(text: str) -> list[str]:
    # Lowercase, strip punctuation, split on whitespace
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    return text.split()

# Your document corpus — list of strings
documents = [
    "Error code E4023 indicates a network timeout in the payment module",
    "Memory management in Python requires understanding garbage collection",
    "Concurrent applications often suffer from race conditions and deadlocks",
    # ... your actual documents
]

tokenized_docs = [tokenize(doc) for doc in documents]
bm25_index = BM25Okapi(tokenized_docs)

def bm25_search(query: str, top_k: int = 20) -> list[tuple[int, float]]:
    tokenized_query = tokenize(query)
    scores = bm25_index.get_scores(tokenized_query)
    # Return (doc_index, score) pairs sorted by score descending
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

Request top_k=20 from each retriever before fusion, not just the final number you want. RRF needs ranking depth to work well — if you only fetch 5 from each, you lose the re-ranking benefit.

Step 3: Build the Dense Vector Index

For the embedding model, all-MiniLM-L6-v2 is the default recommendation you’ll see everywhere. It’s fast and decent. For production RAG on technical or domain-specific content, BAAI/bge-base-en-v1.5 scores consistently higher on retrieval benchmarks and costs the same (it’s local). The semantic search implementation guide covers embedding model selection in detail if you want to go deeper on that decision.

import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Encode all documents — do this once and cache
doc_embeddings = model.encode(documents, normalize_embeddings=True, show_progress_bar=True)
doc_embeddings = doc_embeddings.astype('float32')

# Build FAISS index (cosine similarity via inner product on normalized vectors)
embedding_dim = doc_embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(embedding_dim)
faiss_index.add(doc_embeddings)

def dense_search(query: str, top_k: int = 20) -> list[tuple[int, float]]:
    query_embedding = model.encode([query], normalize_embeddings=True).astype('float32')
    scores, indices = faiss_index.search(query_embedding, top_k)
    # Returns (doc_index, score) pairs
    return list(zip(indices[0].tolist(), scores[0].tolist()))

For corpora under ~50k documents, IndexFlatIP (exact search) is fine. Above that, switch to IndexIVFFlat with a trained quantizer. The FAISS docs are actually good on this — one of the few ML libraries where the documentation matches reality.

Step 4: Implement Reciprocal Rank Fusion

RRF is elegant: for each document, sum 1 / (k + rank) across all retriever rankings, where k is a constant (typically 60). Higher total score wins. It’s robust to score scale differences between BM25 (unbounded) and cosine similarity (0-1), which is the main reason to prefer it over simple score averaging.

def reciprocal_rank_fusion(
    results_list: list[list[tuple[int, float]]],
    k: int = 60,
    top_k: int = 10
) -> list[tuple[int, float]]:
    """
    results_list: list of ranked result lists, each [(doc_idx, score), ...]
    k: RRF constant — higher k reduces the impact of top rankings
    Returns: fused ranking as [(doc_idx, rrf_score), ...]
    """
    fused_scores: dict[int, float] = {}
    
    for results in results_list:
        for rank, (doc_idx, _score) in enumerate(results):
            if doc_idx not in fused_scores:
                fused_scores[doc_idx] = 0.0
            # RRF formula: 1 / (k + rank)
            # rank is 0-indexed so rank=0 is the top result
            fused_scores[doc_idx] += 1.0 / (k + rank + 1)
    
    # Sort by fused score descending
    sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results[:top_k]

def hybrid_search(query: str, top_k: int = 10) -> list[str]:
    """Full hybrid search returning document strings."""
    bm25_results = bm25_search(query, top_k=20)
    dense_results = dense_search(query, top_k=20)
    
    fused = reciprocal_rank_fusion([bm25_results, dense_results], k=60, top_k=top_k)
    
    # Return actual document text
    return [documents[doc_idx] for doc_idx, _score in fused]

The k=60 default is well-established empirically. In practice I’ve found that for short, factual queries (error codes, IDs), dropping k to 20-30 gives BM25 results more weight in the fusion, which helps. For longer semantic queries, k=60 to 80 is better. Worth exposing as a tunable parameter.

Step 5: Wire It Into Claude

Now the payoff. Pass the hybrid-retrieved context chunks directly into Claude’s message. The quality of what you pass in is what determines hallucination rates — garbage retrieval means Claude will fill gaps with plausible fiction. That’s covered well in the guide on reducing LLM hallucinations in production.

import anthropic

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY from env

def rag_query(user_question: str, top_k: int = 5) -> str:
    # Retrieve relevant chunks via hybrid search
    relevant_chunks = hybrid_search(user_question, top_k=top_k)
    
    # Format context block
    context = "\n\n---\n\n".join(
        f"[Source {i+1}]\n{chunk}" 
        for i, chunk in enumerate(relevant_chunks)
    )
    
    system_prompt = (
        "You are a helpful assistant. Answer questions using only the provided context. "
        "If the context doesn't contain enough information to answer, say so explicitly."
    )
    
    user_message = f"""Context:
{context}

Question: {user_question}

Answer based on the context above."""

    response = client.messages.create(
        model="claude-3-5-haiku-20241022",  # ~$0.0008 per 1k input tokens
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    
    return response.content[0].text

# Usage
answer = rag_query("What does error code E4023 mean?")
print(answer)

Using Haiku here costs roughly $0.0008 per 1k input tokens. For a typical RAG call with 5 chunks averaging 200 tokens each (1,000 token context) plus a short question, you’re looking at about $0.001 per query before output tokens. At 10,000 queries/day that’s ~$10/day — completely viable for most products. If you’re processing at higher volume, batch processing with Claude API can cut this further.

Step 6: Tune and Evaluate

Don’t ship without measuring retrieval quality. The fastest approach: take 20-30 real queries from your use case, manually label which documents are relevant, then compute recall@5 and recall@10 for BM25-only, dense-only, and hybrid. You will almost always see hybrid win, but the margin tells you how to weight the RRF k parameter.

def recall_at_k(
    query: str,
    relevant_doc_indices: list[int],
    results: list[tuple[int, float]],
    k: int
) -> float:
    """Calculate recall@k given known relevant document indices."""
    retrieved_indices = {doc_idx for doc_idx, _ in results[:k]}
    relevant_set = set(relevant_doc_indices)
    
    if not relevant_set:
        return 0.0
    
    hits = len(retrieved_indices & relevant_set)
    return hits / len(relevant_set)

# Example evaluation loop
test_cases = [
    {"query": "error code E4023", "relevant": [0]},
    {"query": "Python concurrency memory issues", "relevant": [1, 2]},
]

for test in test_cases:
    bm25_r = bm25_search(test["query"])
    dense_r = dense_search(test["query"])
    hybrid_r = reciprocal_rank_fusion([bm25_r, dense_r])
    
    print(f"Query: {test['query']}")
    print(f"  BM25    recall@5: {recall_at_k(test['query'], test['relevant'], bm25_r, 5):.2f}")
    print(f"  Dense   recall@5: {recall_at_k(test['query'], test['relevant'], dense_r, 5):.2f}")
    print(f"  Hybrid  recall@5: {recall_at_k(test['query'], test['relevant'], hybrid_r, 5):.2f}")

Common Errors and How to Fix Them

BM25 returns zero scores for all documents

This usually means your tokenizer is producing tokens that don’t overlap with the corpus. Check whether your corpus contains significant non-ASCII content (product names, unicode chars) that your regex strips entirely. The fix: use a proper tokenizer like nltk.word_tokenize or just switch to character-level n-gram fallback for short, code-like tokens.

FAISS index gives wrong results after adding documents

IndexFlatIP requires L2-normalized vectors for cosine similarity. If you forget normalize_embeddings=True in the encode call, or add un-normalized vectors to the index, scores become meaningless inner products instead of cosine similarities. Always normalize before adding to the index and before querying. You can check: np.linalg.norm(doc_embeddings[0]) should return exactly 1.0.

RRF fusion heavily favors one retriever

If one retriever consistently returns results with many zero-scored documents, those zeros still occupy rank positions in your results list, which skews RRF. Filter out zero-score results before passing to RRF: [(idx, score) for idx, score in results if score > 0]. This is especially important for BM25 when a query uses vocabulary completely absent from the corpus.

For more patterns on handling retrieval failures gracefully in production, the LLM fallback and retry logic guide has applicable patterns — the same degradation principles apply when one retriever path fails.

What to Build Next

Add query rewriting before retrieval. A single LLM call to expand the user’s query into 2-3 variants (one keyword-focused, one semantic paraphrase) before running hybrid search, then fusing all result sets together, dramatically improves recall on ambiguous or poorly-worded queries. The cost is one cheap Haiku call (~$0.0002) per query — easily worth it for production systems where retrieval quality directly affects answer quality.

Frequently Asked Questions

Is hybrid search RAG retrieval always better than pure vector search?

For general knowledge corpora with natural language queries, pure vector search is often sufficient. Hybrid search shows its biggest advantage when your corpus contains exact identifiers, product codes, error messages, proper nouns, or technical terminology that embeddings tend to smooth over. If your users search with vague intent, vector search alone may be fine — measure it on your actual queries before adding complexity.

What is Reciprocal Rank Fusion and why use it over score normalization?

RRF combines rankings by summing 1 / (k + rank) across retrievers rather than normalizing and averaging raw scores. This matters because BM25 scores are unbounded and query-dependent, while cosine similarity is bounded at 1.0 — normalizing them is mathematically tricky and brittle. RRF sidesteps this entirely by only caring about relative order, not magnitude, making it robust across very different scoring scales.

How do I handle hybrid search with a hosted vector database like Pinecone or Qdrant?

Qdrant has native sparse vector support that lets you store BM25-style sparse vectors alongside dense embeddings and query both in one call — it’s the cleanest production solution. Pinecone introduced sparse-dense hybrid search in their serverless tier. If you’re using either, skip the manual BM25 implementation above and use their native hybrid query APIs — you get the same RRF fusion but with vector DB-scale performance. The tradeoff is vendor lock-in and slightly higher per-query cost.

How many chunks should I retrieve before passing to Claude?

Start with 5 chunks for focused factual queries, 8-10 for broader synthesis questions. More context does not always mean better answers — it increases the chance of irrelevant content confusing the model and raises token costs linearly. Measure answer quality vs. context size on your specific use case. Claude 3.5 Sonnet handles long contexts well, but you’re often paying for tokens that don’t improve the final answer.

Can I use OpenAI embeddings instead of sentence-transformers for the dense retriever?

Yes — replace the model.encode() calls with OpenAI’s text-embedding-3-small API. It costs $0.02 per million tokens at current pricing, which is cheap at document-indexing time but adds up if you re-embed frequently. For the query embedding (once per search), cost is negligible. Local sentence-transformers models are free after the initial download and perform comparably on most English corpora, so I’d default to those unless you have a specific reason to need OpenAI’s embeddings.

Put this into practice

Try the Search Specialist agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Hybrid Search for RAG: Combining Dense and Keyword-Based Retrieval

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Hybrid Search for RAG: Combining Dense and Keyword-Based Retrieval

Step 1: Install Dependencies

Step 2: Build the BM25 Index

Step 3: Build the Dense Vector Index

Step 4: Implement Reciprocal Rank Fusion

Step 5: Wire It Into Claude

Step 6: Tune and Evaluate

Common Errors and How to Fix Them

BM25 returns zero scores for all documents

FAISS index gives wrong results after adding documents

RRF fusion heavily favors one retriever

What to Build Next

Frequently Asked Questions

Is hybrid search RAG retrieval always better than pure vector search?

What is Reciprocal Rank Fusion and why use it over score normalization?

How do I handle hybrid search with a hosted vector database like Pinecone or Qdrant?

How many chunks should I retrieve before passing to Claude?

Can I use OpenAI embeddings instead of sentence-transformers for the dense retriever?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation