Semantic search beyond embeddings: BM25, hybrid search, and reranking for agent knowledge bases

Q: Can I run cross-encoder reranking locally without a GPU?

Yes. The cross-encoder/ms-marco-MiniLM-L-6-v2 model from sentence-transformers runs on CPU and reranks 50 candidates in roughly 80–150ms on a modern laptop. It's about 80MB and produces strong results for most English-language retrieval tasks. The larger electra-based models need 400–800ms on CPU, which may be acceptable for async pipelines but too slow for real-time agent responses.

Pure embedding search feels magical until it fails you in production. You ask your agent “what’s the refund policy for orders placed with a discount code?” and it returns three vaguely related chunks about return windows and promotional terms — never surfacing the one paragraph that actually answers the question. The culprit isn’t your vector database. It’s that semantic similarity alone is the wrong tool for retrieval jobs that require exact term matching, recency weighting, or domain-specific jargon. Building robust hybrid semantic search agents means combining BM25 keyword retrieval, dense vector search, and cross-encoder reranking into a single pipeline — and knowing when to weight each component.

This isn’t theoretical. We’ve seen this failure mode across support bots, internal knowledge bases, and document Q&A systems. If you’ve already built a basic RAG pipeline (here’s a from-scratch implementation guide if you haven’t), this is the next layer you need to add before going anywhere near production traffic.

Why Pure Embedding Search Fails in Production

The most common misconception: “better embeddings fix retrieval.” They don’t — not fundamentally. Embeddings compress meaning into a fixed-size vector, which means they trade precision for generalization. That tradeoff is excellent for paraphrase matching (“how do I cancel?” → “subscription termination process”) but terrible for:

Exact term lookup: product SKUs, error codes, named entities, version numbers
Rare or domain-specific vocabulary: internal project names, legal clause references, medical codes
Short queries against long documents: the embedding of “API rate limit” may not strongly match a 2000-word document that mentions it once but is actually the right answer
Negation handling: “plans that do NOT include storage” — embeddings are notoriously bad at this

BM25 (Best Match 25) is a probabilistic term-frequency ranking algorithm that’s been the backbone of Elasticsearch and Solr for decades. It has the opposite strengths: exact term matching, TF-IDF-style weighting, and document length normalization. It’s fast, deterministic, and doesn’t require a GPU. What it misses: synonyms, paraphrases, and anything that requires semantic understanding.

Hybrid search combines both. The question is how.

The Architecture: BM25 + Dense Vectors + Reranking

Stage 1: Parallel Retrieval

Run BM25 and vector search simultaneously, retrieving the top-k candidates from each (typically k=20–50 per channel). You’ll get overlap — documents that score well on both — and non-overlapping candidates that are strong on only one dimension. The combined candidate pool before reranking is usually 40–80 documents.

from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, documents, embedder, vector_index, bm25_weight=0.4, vector_weight=0.6):
        self.docs = documents
        self.embedder = embedder
        self.vector_index = vector_index
        self.bm25_weight = bm25_weight
        self.vector_weight = vector_weight

        # Tokenize for BM25 — simple whitespace split works, but use a real tokenizer in prod
        tokenized = [doc["text"].lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query: str, top_k: int = 20) -> list[dict]:
        # BM25 retrieval
        query_tokens = query.lower().split()
        bm25_scores = self.bm25.get_scores(query_tokens)
        bm25_top_idx = np.argsort(bm25_scores)[::-1][:top_k]

        # Vector retrieval
        query_embedding = self.embedder.encode(query)
        vector_results = self.vector_index.query(query_embedding, top_k=top_k)
        vector_top_idx = [r["id"] for r in vector_results]

        # Reciprocal Rank Fusion (RRF) to merge scores
        rrf_scores = {}
        k_constant = 60  # standard RRF constant

        for rank, idx in enumerate(bm25_top_idx):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + (
                self.bm25_weight / (k_constant + rank + 1)
            )

        for rank, idx in enumerate(vector_top_idx):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + (
                self.vector_weight / (k_constant + rank + 1)
            )

        # Sort by combined RRF score
        merged = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
        return [self.docs[idx] for idx, _ in merged[:top_k]]

The Reciprocal Rank Fusion formula is the key detail here. Don’t try to normalize raw BM25 scores against cosine similarities — their scales are incompatible and normalizing them makes assumptions that break on out-of-distribution queries. RRF works on rank position only, which makes it distribution-agnostic. The k_constant=60 is empirically validated across multiple IR benchmarks; don’t mess with it unless you have labeled evaluation data.

Stage 2: Cross-Encoder Reranking

Bi-encoder models (the kind that produce embeddings) score query and document independently and compare the vectors. Cross-encoders read the query and document together, which is massively more expensive but dramatically more accurate. You use them on the small candidate pool after first-stage retrieval — never on your full corpus.

from sentence_transformers import CrossEncoder

class ReRanker:
    def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
        # This model runs locally, ~80MB, decent quality
        # For higher quality: cross-encoder/ms-marco-electra-base (~400MB)
        self.model = CrossEncoder(model_name)

    def rerank(self, query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
        pairs = [(query, doc["text"]) for doc in candidates]
        scores = self.model.predict(pairs)  # returns numpy array of scores

        # Attach scores and sort
        scored = sorted(
            zip(scores, candidates),
            key=lambda x: x[0],
            reverse=True
        )

        # Return top_n with score attached
        return [
            {**doc, "rerank_score": float(score)}
            for score, doc in scored[:top_n]
        ]

On a MacBook Pro M2, `cross-encoder/ms-marco-MiniLM-L-6-v2` reranks 50 candidates in roughly 80–120ms. The larger `electra-base` model takes about 400ms on the same hardware but improves NDCG@10 by ~3–5 points on MSMARCO benchmarks. For latency-sensitive agents, stick with the MiniLM variant. If you’re doing async or batch retrieval, the electra model is worth it.

Alternatively, you can use Cohere’s Rerank API (`rerank-english-v3.0`): ~$0.001 per 1000 tokens of input, no infrastructure to manage. For a typical reranking call with 50 candidates at ~200 tokens each, that’s around $0.01 per query. At 10,000 queries/day, that’s $100/day — not cheap. The local cross-encoder pays for itself fast at that scale.

Where to Actually Run BM25: Practical Options

If you’re already using Elasticsearch or OpenSearch, you have BM25 built in. Adding a hybrid retrieval layer means running a BM25 query and a kNN query in parallel, then fusing in application code. Both support native hybrid search in recent versions, but the fusion logic is limited — you’ll want RRF or your own merging for serious use cases.

Qdrant now has native sparse vector support, which lets you run BM25-style retrieval using SPLADE or similar sparse representations entirely within the vector DB. This is architecturally cleaner than maintaining separate systems. Our vector database comparison covers the production tradeoffs in detail, but the short version: if you’re starting fresh and want hybrid, Qdrant is the least painful option.

For lightweight deployments (docs under 100K, no dedicated infra), `rank_bm25` in Python + a local FAISS index is entirely workable. The code above is basically the full implementation. You can keep it in memory for corpora under ~50MB without issue.

LLM-Based Reranking: When It’s Actually Worth It

There’s a third option beyond cross-encoders: ask an LLM to score each candidate’s relevance. This is slower and more expensive, but it unlocks reasoning that neither BM25 nor embeddings can do — like understanding that a chunk is relevant because it establishes a prerequisite concept, not because it mentions the query terms.

import anthropic
import json

client = anthropic.Anthropic()

def llm_rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    # Format candidates as a numbered list for the LLM
    candidate_text = "\n\n".join([
        f"[{i+1}] {doc['text'][:500]}"  # truncate to keep costs sane
        for i, doc in enumerate(candidates)
    ])

    prompt = f"""You are a relevance judge. Given the query and candidate passages below,
return a JSON array of candidate numbers ranked from most to least relevant.
Only include candidates that are actually relevant.

Query: {query}

Candidates:
{candidate_text}

Return ONLY a JSON array of integers, e.g. [3, 1, 7, 2]"""

    response = client.messages.create(
        model="claude-haiku-20240307",  # cheap model for scoring tasks
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )

    ranking = json.loads(response.content[0].text.strip())
    # ranking is 1-indexed
    return [candidates[i-1] for i in ranking[:top_n] if i <= len(candidates)]

Using Claude Haiku for this costs roughly $0.0003–$0.0008 per reranking call (depending on candidate count and truncation). That’s comparable to Cohere Rerank but gives you much more control over the scoring criteria — you can tell it to prioritize recency, penalize contradictory information, or weight certain document types. The tradeoff is latency: expect 800ms–2s per call even with Haiku. This approach makes more sense in async retrieval pipelines than in real-time agent responses. It’s also directly useful for reducing hallucinations — LLM reranking can explicitly filter candidates that contradict established facts.

Tuning the BM25/Vector Weight Split

The single most common mistake: setting weights once and never revisiting them. The right ratio depends on your query distribution:

High keyword specificity (error codes, SKUs, named entities): BM25 weight 0.6–0.7
Conversational/paraphrase-heavy queries: vector weight 0.7–0.8
Mixed enterprise knowledge bases: start at 50/50, evaluate with labeled queries

If you don’t have labeled evaluation data, generate synthetic queries from your documents using an LLM — ask it to write 3 questions that each chunk would answer. Run those questions through your retrieval pipeline and measure how often the source chunk appears in the top-5 results (Recall@5). This gives you a tunable signal without expensive human annotation.

One thing the documentation won’t tell you: BM25 performance degrades when your documents have inconsistent tokenization. If you’re chunking PDFs, you’ll often get hyphenated words, ligatures, and OCR artifacts that fragment terms. Pre-process with a consistent tokenizer (spaCy’s tokenizer is robust for this) before indexing. The embeddings will absorb some of this noise; BM25 won’t.

Production Implementation Notes

A few things that bite people in production that aren’t covered in most tutorials:

Index synchronization: When you update documents, you need to update both your BM25 index and your vector index atomically — or you’ll have retrieval divergence. The simplest approach is to version your indices and swap them atomically on update rather than doing partial updates.

Query preprocessing matters more than you think: Lowercasing, removing stopwords, and handling synonyms at the query level (not just the index level) gives consistent BM25 gains. For domain-specific systems, a small synonym dictionary (e.g., “ML” → “machine learning”) can improve BM25 recall significantly.

Caching: If your queries have any repetition (common in agent workflows), cache the full hybrid retrieval result by query hash. Even a 10-minute TTL cache dramatically cuts costs and latency in practice. This is especially relevant if you’re using LLM reranking. For more on handling production reliability patterns, the LLM fallback and retry logic guide has patterns that apply directly to retrieval pipelines.

Chunk size interaction: BM25 and dense retrieval have different optimal chunk sizes. BM25 tends to prefer longer chunks (more term context). Dense retrieval often works better with shorter, more focused chunks. A practical solution: index two chunk granularities — sentence-level for dense retrieval, paragraph-level for BM25 — and merge the results. More infrastructure complexity, but meaningfully better retrieval on mixed query types.

The Bottom Line: Which Setup for Which Team

Solo developer or early-stage product: Start with `rank_bm25` + `sentence-transformers` + `cross-encoder/ms-marco-MiniLM-L-6-v2` running locally. Zero infrastructure cost, the code above is 90% of what you need, and it will meaningfully outperform pure vector search with an afternoon of work. Move to managed infrastructure when you have >100K documents or >1000 queries/day.

Team with existing Elasticsearch: Add kNN search to your existing ES cluster (available in ES 8.x) and implement RRF fusion in your application layer. You get hybrid search without new infrastructure. Add the local cross-encoder for reranking — it’s the highest-ROI addition you can make to an existing retrieval stack.

Enterprise / high-throughput: Qdrant with sparse + dense vectors for retrieval, Cohere Rerank or a self-hosted cross-encoder for reranking, Redis for result caching. If you’re evaluating models for the LLM layer of your agent, the LangChain vs LlamaIndex vs plain Python comparison is worth reading before choosing your orchestration approach — retrieval and orchestration are deeply coupled decisions.

The verdict on hybrid semantic search agents: this is not optional complexity for production systems. Pure vector search is a prototype-tier retrieval strategy. BM25 + dense vectors + cross-encoder reranking is the minimum viable stack for agent knowledge bases that need to reliably surface the right information. The implementation overhead is low (the code above is literally what we run in production), the latency cost is 80–200ms on top of your baseline, and the improvement in retrieval quality — especially for domain-specific or keyword-heavy corpora — is substantial enough to justify it every time.

Frequently Asked Questions

What is the difference between BM25 and semantic search?

BM25 is a term-frequency ranking algorithm that scores documents based on exact keyword matches, adjusted for document length and term frequency. Semantic search uses dense vector embeddings to match by meaning, even when the exact words differ. BM25 excels at precise term lookup (product codes, error messages); semantic search excels at paraphrase and synonym handling. Hybrid systems use both in parallel.

How do I combine BM25 and vector search scores?

Don’t try to normalize the raw scores — BM25 scores and cosine similarities have incompatible scales. Use Reciprocal Rank Fusion (RRF) instead: convert each retrieval method’s output into a ranked list, then combine rank positions using the formula score = weight / (k + rank), where k=60 is the standard constant. This is distribution-agnostic and works well without labeled data to tune on.

Can I run cross-encoder reranking locally without a GPU?

Yes. The cross-encoder/ms-marco-MiniLM-L-6-v2 model from sentence-transformers runs on CPU and reranks 50 candidates in roughly 80–150ms on a modern laptop. It’s about 80MB and produces strong results for most English-language retrieval tasks. The larger electra-based models need 400–800ms on CPU, which may be acceptable for async pipelines but too slow for real-time agent responses.

When should I use an LLM for reranking instead of a cross-encoder?

Use LLM reranking when you need to apply complex business logic to relevance scoring — like preferring more recent documents, filtering candidates that contradict known facts, or weighting results by document type. Cross-encoders are faster and cheaper for pure relevance ranking. LLM reranking adds 800ms–2s of latency and costs $0.0003–$0.001 per call with Haiku, so it belongs in async or non-latency-critical pipelines.

Does hybrid search work with existing Pinecone or Weaviate deployments?

Pinecone added sparse-dense hybrid search support (using SPLADE representations) — you can send a sparse and dense vector simultaneously and it handles fusion internally. Weaviate supports hybrid search natively with configurable alpha weighting between BM25 and vector results. Both are valid options, though the fusion logic is less flexible than implementing RRF yourself in application code.

What chunk size should I use for hybrid search?

BM25 prefers longer chunks (more term context, typically 400–600 tokens) while dense retrieval often works better with shorter chunks (100–200 tokens for precision). A practical compromise is 300 tokens with 50-token overlap for a single index, or maintaining two chunk granularities if retrieval quality is critical. Always evaluate against real queries from your use case — optimal chunk size is highly domain-dependent.

Put this into practice

Try the Search Specialist agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Semantic search beyond embeddings: BM25, hybrid search, and reranking for agent knowledge bases

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Semantic search beyond embeddings: BM25, hybrid search, and reranking for agent knowledge bases

Why Pure Embedding Search Fails in Production

The Architecture: BM25 + Dense Vectors + Reranking

Stage 1: Parallel Retrieval

Stage 2: Cross-Encoder Reranking

Where to Actually Run BM25: Practical Options

LLM-Based Reranking: When It’s Actually Worth It

Tuning the BM25/Vector Weight Split

Production Implementation Notes

The Bottom Line: Which Setup for Which Team

Frequently Asked Questions

What is the difference between BM25 and semantic search?

How do I combine BM25 and vector search scores?

Can I run cross-encoder reranking locally without a GPU?

When should I use an LLM for reranking instead of a cross-encoder?

Does hybrid search work with existing Pinecone or Weaviate deployments?

What chunk size should I use for hybrid search?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation