By the end of this tutorial, you’ll have a working hybrid search pipeline that combines BM25 keyword matching with dense vector retrieval, fused using Reciprocal Rank Fusion (RRF), ready to drop into any RAG system backed by Claude. The improvement over pure vector search is not marginal — on domain-specific corpora with exact product names, error codes, or medical terminology, hybrid search RAG retrieval consistently outperforms either approach alone by 15-30% on recall@10.
The core problem: dense embeddings are excellent at semantic similarity but notoriously bad at exact token matching. Ask a vector-only system for “error code E4023” and it’ll surface documents that are semantically adjacent to error codes — not necessarily the one containing “E4023”. BM25 finds it instantly. But ask BM25 about “memory leaks in concurrent Python applications” without those exact words, and it fails. You need both. If you’re building on top of a RAG pipeline, check this RAG pipeline from scratch guide first if you haven’t set up the base layer yet.
- Install dependencies — Set up rank-bm25, sentence-transformers, and supporting libraries
- Build the BM25 index — Tokenize and index your document corpus
- Build the dense vector index — Embed documents with sentence-transformers
- Implement Reciprocal Rank Fusion — Merge ranked results from both retrievers
- Wire it into Claude — Feed fused results as RAG context to Claude’s API
- Tune and evaluate — Adjust RRF k-parameter and measure retrieval quality
Step 1: Install Dependencies
You need four packages. rank-bm25 for keyword retrieval, sentence-transformers for dense embeddings, anthropic for Claude, and numpy for score fusion math. Optionally, faiss-cpu for fast approximate nearest neighbour search at scale.
pip install rank-bm25 sentence-transformers anthropic numpy faiss-cpu
At time of writing, sentence-transformers==2.7.0 and rank-bm25==0.2.2 are stable. Pin these. The sentence-transformers API has broken silently between minor versions more than once.
Step 2: Build the BM25 Index
BM25 operates on tokenized text. The tokenization quality matters more than most tutorials admit — simple whitespace splitting works, but a proper tokenizer that lowercases and strips punctuation gives meaningfully better results.
from rank_bm25 import BM25Okapi
import re
def tokenize(text: str) -> list[str]:
# Lowercase, strip punctuation, split on whitespace
text = text.lower()
text = re.sub(r'[^a-z0-9\s]', ' ', text)
return text.split()
# Your document corpus — list of strings
documents = [
"Error code E4023 indicates a network timeout in the payment module",
"Memory management in Python requires understanding garbage collection",
"Concurrent applications often suffer from race conditions and deadlocks",
# ... your actual documents
]
tokenized_docs = [tokenize(doc) for doc in documents]
bm25_index = BM25Okapi(tokenized_docs)
def bm25_search(query: str, top_k: int = 20) -> list[tuple[int, float]]:
tokenized_query = tokenize(query)
scores = bm25_index.get_scores(tokenized_query)
# Return (doc_index, score) pairs sorted by score descending
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
return ranked[:top_k]
Request top_k=20 from each retriever before fusion, not just the final number you want. RRF needs ranking depth to work well — if you only fetch 5 from each, you lose the re-ranking benefit.
Step 3: Build the Dense Vector Index
For the embedding model, all-MiniLM-L6-v2 is the default recommendation you’ll see everywhere. It’s fast and decent. For production RAG on technical or domain-specific content, BAAI/bge-base-en-v1.5 scores consistently higher on retrieval benchmarks and costs the same (it’s local). The semantic search implementation guide covers embedding model selection in detail if you want to go deeper on that decision.
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
# Encode all documents — do this once and cache
doc_embeddings = model.encode(documents, normalize_embeddings=True, show_progress_bar=True)
doc_embeddings = doc_embeddings.astype('float32')
# Build FAISS index (cosine similarity via inner product on normalized vectors)
embedding_dim = doc_embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(embedding_dim)
faiss_index.add(doc_embeddings)
def dense_search(query: str, top_k: int = 20) -> list[tuple[int, float]]:
query_embedding = model.encode([query], normalize_embeddings=True).astype('float32')
scores, indices = faiss_index.search(query_embedding, top_k)
# Returns (doc_index, score) pairs
return list(zip(indices[0].tolist(), scores[0].tolist()))
For corpora under ~50k documents, IndexFlatIP (exact search) is fine. Above that, switch to IndexIVFFlat with a trained quantizer. The FAISS docs are actually good on this — one of the few ML libraries where the documentation matches reality.
Step 4: Implement Reciprocal Rank Fusion
RRF is elegant: for each document, sum 1 / (k + rank) across all retriever rankings, where k is a constant (typically 60). Higher total score wins. It’s robust to score scale differences between BM25 (unbounded) and cosine similarity (0-1), which is the main reason to prefer it over simple score averaging.
def reciprocal_rank_fusion(
results_list: list[list[tuple[int, float]]],
k: int = 60,
top_k: int = 10
) -> list[tuple[int, float]]:
"""
results_list: list of ranked result lists, each [(doc_idx, score), ...]
k: RRF constant — higher k reduces the impact of top rankings
Returns: fused ranking as [(doc_idx, rrf_score), ...]
"""
fused_scores: dict[int, float] = {}
for results in results_list:
for rank, (doc_idx, _score) in enumerate(results):
if doc_idx not in fused_scores:
fused_scores[doc_idx] = 0.0
# RRF formula: 1 / (k + rank)
# rank is 0-indexed so rank=0 is the top result
fused_scores[doc_idx] += 1.0 / (k + rank + 1)
# Sort by fused score descending
sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:top_k]
def hybrid_search(query: str, top_k: int = 10) -> list[str]:
"""Full hybrid search returning document strings."""
bm25_results = bm25_search(query, top_k=20)
dense_results = dense_search(query, top_k=20)
fused = reciprocal_rank_fusion([bm25_results, dense_results], k=60, top_k=top_k)
# Return actual document text
return [documents[doc_idx] for doc_idx, _score in fused]
The k=60 default is well-established empirically. In practice I’ve found that for short, factual queries (error codes, IDs), dropping k to 20-30 gives BM25 results more weight in the fusion, which helps. For longer semantic queries, k=60 to 80 is better. Worth exposing as a tunable parameter.
Step 5: Wire It Into Claude
Now the payoff. Pass the hybrid-retrieved context chunks directly into Claude’s message. The quality of what you pass in is what determines hallucination rates — garbage retrieval means Claude will fill gaps with plausible fiction. That’s covered well in the guide on reducing LLM hallucinations in production.
import anthropic
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY from env
def rag_query(user_question: str, top_k: int = 5) -> str:
# Retrieve relevant chunks via hybrid search
relevant_chunks = hybrid_search(user_question, top_k=top_k)
# Format context block
context = "\n\n---\n\n".join(
f"[Source {i+1}]\n{chunk}"
for i, chunk in enumerate(relevant_chunks)
)
system_prompt = (
"You are a helpful assistant. Answer questions using only the provided context. "
"If the context doesn't contain enough information to answer, say so explicitly."
)
user_message = f"""Context:
{context}
Question: {user_question}
Answer based on the context above."""
response = client.messages.create(
model="claude-3-5-haiku-20241022", # ~$0.0008 per 1k input tokens
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
# Usage
answer = rag_query("What does error code E4023 mean?")
print(answer)
Using Haiku here costs roughly $0.0008 per 1k input tokens. For a typical RAG call with 5 chunks averaging 200 tokens each (1,000 token context) plus a short question, you’re looking at about $0.001 per query before output tokens. At 10,000 queries/day that’s ~$10/day — completely viable for most products. If you’re processing at higher volume, batch processing with Claude API can cut this further.
Step 6: Tune and Evaluate
Don’t ship without measuring retrieval quality. The fastest approach: take 20-30 real queries from your use case, manually label which documents are relevant, then compute recall@5 and recall@10 for BM25-only, dense-only, and hybrid. You will almost always see hybrid win, but the margin tells you how to weight the RRF k parameter.
def recall_at_k(
query: str,
relevant_doc_indices: list[int],
results: list[tuple[int, float]],
k: int
) -> float:
"""Calculate recall@k given known relevant document indices."""
retrieved_indices = {doc_idx for doc_idx, _ in results[:k]}
relevant_set = set(relevant_doc_indices)
if not relevant_set:
return 0.0
hits = len(retrieved_indices & relevant_set)
return hits / len(relevant_set)
# Example evaluation loop
test_cases = [
{"query": "error code E4023", "relevant": [0]},
{"query": "Python concurrency memory issues", "relevant": [1, 2]},
]
for test in test_cases:
bm25_r = bm25_search(test["query"])
dense_r = dense_search(test["query"])
hybrid_r = reciprocal_rank_fusion([bm25_r, dense_r])
print(f"Query: {test['query']}")
print(f" BM25 recall@5: {recall_at_k(test['query'], test['relevant'], bm25_r, 5):.2f}")
print(f" Dense recall@5: {recall_at_k(test['query'], test['relevant'], dense_r, 5):.2f}")
print(f" Hybrid recall@5: {recall_at_k(test['query'], test['relevant'], hybrid_r, 5):.2f}")
Common Errors and How to Fix Them
BM25 returns zero scores for all documents
This usually means your tokenizer is producing tokens that don’t overlap with the corpus. Check whether your corpus contains significant non-ASCII content (product names, unicode chars) that your regex strips entirely. The fix: use a proper tokenizer like nltk.word_tokenize or just switch to character-level n-gram fallback for short, code-like tokens.
FAISS index gives wrong results after adding documents
IndexFlatIP requires L2-normalized vectors for cosine similarity. If you forget normalize_embeddings=True in the encode call, or add un-normalized vectors to the index, scores become meaningless inner products instead of cosine similarities. Always normalize before adding to the index and before querying. You can check: np.linalg.norm(doc_embeddings[0]) should return exactly 1.0.
RRF fusion heavily favors one retriever
If one retriever consistently returns results with many zero-scored documents, those zeros still occupy rank positions in your results list, which skews RRF. Filter out zero-score results before passing to RRF: [(idx, score) for idx, score in results if score > 0]. This is especially important for BM25 when a query uses vocabulary completely absent from the corpus.
For more patterns on handling retrieval failures gracefully in production, the LLM fallback and retry logic guide has applicable patterns — the same degradation principles apply when one retriever path fails.
What to Build Next
Add query rewriting before retrieval. A single LLM call to expand the user’s query into 2-3 variants (one keyword-focused, one semantic paraphrase) before running hybrid search, then fusing all result sets together, dramatically improves recall on ambiguous or poorly-worded queries. The cost is one cheap Haiku call (~$0.0002) per query — easily worth it for production systems where retrieval quality directly affects answer quality.
Frequently Asked Questions
Is hybrid search RAG retrieval always better than pure vector search?
For general knowledge corpora with natural language queries, pure vector search is often sufficient. Hybrid search shows its biggest advantage when your corpus contains exact identifiers, product codes, error messages, proper nouns, or technical terminology that embeddings tend to smooth over. If your users search with vague intent, vector search alone may be fine — measure it on your actual queries before adding complexity.
What is Reciprocal Rank Fusion and why use it over score normalization?
RRF combines rankings by summing 1 / (k + rank) across retrievers rather than normalizing and averaging raw scores. This matters because BM25 scores are unbounded and query-dependent, while cosine similarity is bounded at 1.0 — normalizing them is mathematically tricky and brittle. RRF sidesteps this entirely by only caring about relative order, not magnitude, making it robust across very different scoring scales.
How do I handle hybrid search with a hosted vector database like Pinecone or Qdrant?
Qdrant has native sparse vector support that lets you store BM25-style sparse vectors alongside dense embeddings and query both in one call — it’s the cleanest production solution. Pinecone introduced sparse-dense hybrid search in their serverless tier. If you’re using either, skip the manual BM25 implementation above and use their native hybrid query APIs — you get the same RRF fusion but with vector DB-scale performance. The tradeoff is vendor lock-in and slightly higher per-query cost.
How many chunks should I retrieve before passing to Claude?
Start with 5 chunks for focused factual queries, 8-10 for broader synthesis questions. More context does not always mean better answers — it increases the chance of irrelevant content confusing the model and raises token costs linearly. Measure answer quality vs. context size on your specific use case. Claude 3.5 Sonnet handles long contexts well, but you’re often paying for tokens that don’t improve the final answer.
Can I use OpenAI embeddings instead of sentence-transformers for the dense retriever?
Yes — replace the model.encode() calls with OpenAI’s text-embedding-3-small API. It costs $0.02 per million tokens at current pricing, which is cheap at document-indexing time but adds up if you re-embed frequently. For the query embedding (once per search), cost is negligible. Local sentence-transformers models are free after the initial download and perform comparably on most English corpora, so I’d default to those unless you have a specific reason to need OpenAI’s embeddings.
Put this into practice
Try the Search Specialist agent — ready to use, no setup required.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

