Semantic Search Implementation Guide: Building Vector Search for Agent Knowledge Bases

Q: Can I use a local embedding model instead of OpenAI's API?

Yes. Models like nomic-embed-text via Ollama or bge-small-en-v1.5 via sentence-transformers run locally at zero per-call cost. The tradeoff is quality: OpenAI's text-embedding-3-small consistently outperforms open-source models on general retrieval benchmarks by 5-15 points. For sensitive data or high-volume indexing where API costs add up, local models are a reasonable choice — just benchmark on your actual corpus before committing.

Q: How do I update the knowledge base when documents change?

Store each chunk with a doc_id and a content hash in the Qdrant payload. On update, re-chunk and re-embed only the changed document, then delete old points by filtering on doc_id and upsert the new ones. Qdrant's delete method accepts payload filters, making this a clean two-step operation. Avoid full re-indexing unless you're changing embedding models — it's slow and unnecessary.

By the end of this tutorial, you’ll have a working semantic search vector database pipeline that lets a Claude agent retrieve relevant documents using meaning-based similarity — not keyword matching. You’ll embed documents into Qdrant, query them at runtime, and inject the results into Claude’s context window.

Keyword search breaks the moment your users phrase things differently from how your documents are written. A support agent that can only match “refund policy” will miss “can I get my money back” entirely. Semantic search fixes this by comparing meaning in embedding space — and it’s what separates a useful knowledge base from a frustrating one. If you’re also fighting hallucination problems, grounding Claude with retrieved context is one of the most reliable mitigation strategies available.

Install dependencies — set up Python environment with Qdrant, OpenAI embeddings, and Anthropic SDK
Prepare and chunk documents — split source text into retrievable chunks with metadata
Generate embeddings — convert chunks to vectors using text-embedding-3-small
Index into Qdrant — create a collection and upsert your vectors
Build the retrieval function — query the vector DB and return top-k results
Wire it into Claude — inject retrieved context into the system prompt for grounded answers

Step 1: Install Dependencies

I’m using Qdrant for the vector store (it runs locally in Docker with zero configuration), OpenAI’s embedding API for the vectors, and the Anthropic SDK for Claude. You could swap in Pinecone or Weaviate — there’s a detailed comparison of those options here — but for a tutorial Qdrant local is the fastest path to something running.

# Start Qdrant locally via Docker
docker run -p 6333:6333 qdrant/qdrant

# Install Python dependencies
pip install qdrant-client openai anthropic tiktoken

Pin your versions. Qdrant’s Python client had breaking API changes between 1.x and 1.7.x that will burn you if you don’t.

pip install qdrant-client==1.9.1 openai==1.30.1 anthropic==0.28.0 tiktoken==0.7.0

Step 2: Prepare and Chunk Documents

Chunking strategy matters more than most tutorials admit. Too large and your retrieved context is noisy; too small and you lose surrounding context that gives sentences meaning. I’ve landed on 400-token chunks with a 50-token overlap as a solid default for knowledge base documents.

import tiktoken

def chunk_text(text: str, chunk_size: int = 400, overlap: int = 50) -> list[dict]:
    """Split text into overlapping chunks with token-based sizing."""
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    
    start = 0
    chunk_index = 0
    
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text_str = enc.decode(chunk_tokens)
        
        chunks.append({
            "text": chunk_text_str,
            "chunk_index": chunk_index,
            "token_count": len(chunk_tokens)
        })
        
        # Move forward by chunk_size minus overlap
        start += chunk_size - overlap
        chunk_index += 1
    
    return chunks

# Example: load your documents
documents = [
    {"id": "refund-policy", "title": "Refund Policy", "content": "...your policy text..."},
    {"id": "shipping-info", "title": "Shipping Information", "content": "...shipping text..."},
]

all_chunks = []
for doc in documents:
    chunks = chunk_text(doc["content"])
    for chunk in chunks:
        chunk["doc_id"] = doc["id"]
        chunk["doc_title"] = doc["title"]
        all_chunks.append(chunk)

print(f"Total chunks: {len(all_chunks)}")

Step 3: Generate Embeddings

I’m using text-embedding-3-small here. It produces 1536-dimensional vectors and costs $0.02 per million tokens — for a 500-document knowledge base you’re looking at well under $0.10 total for the initial indexing run. text-embedding-3-large is measurably better on retrieval benchmarks but costs 13x more; the small model is the right call unless you’re working with highly technical or domain-specific content where precision is critical.

from openai import OpenAI
import time

client = OpenAI(api_key="your-openai-key")

def embed_chunks(chunks: list[dict], batch_size: int = 100) -> list[dict]:
    """Embed chunks in batches to respect API rate limits."""
    embedded = []
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["text"] for c in batch]
        
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )
        
        for chunk, embedding_obj in zip(batch, response.data):
            chunk["embedding"] = embedding_obj.embedding
            embedded.append(chunk)
        
        # Respect rate limits — 3000 RPM on tier 1
        if i + batch_size < len(chunks):
            time.sleep(0.1)
    
    return embedded

embedded_chunks = embed_chunks(all_chunks)
print(f"Embedded {len(embedded_chunks)} chunks")

Step 4: Index into Qdrant

Create a collection with the right vector size (1536 for text-embedding-3-small) and cosine distance. Cosine similarity is the right metric for text embeddings — don’t use Euclidean distance here, it produces worse results with normalized vectors.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

qdrant = QdrantClient(host="localhost", port=6333)

COLLECTION_NAME = "knowledge_base"
VECTOR_SIZE = 1536  # text-embedding-3-small dimensions

# Create collection (idempotent — won't error if it exists)
qdrant.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE)
)

# Upsert all embedded chunks
points = []
for chunk in embedded_chunks:
    points.append(
        PointStruct(
            id=str(uuid.uuid4()),
            vector=chunk["embedding"],
            payload={
                "text": chunk["text"],
                "doc_id": chunk["doc_id"],
                "doc_title": chunk["doc_title"],
                "chunk_index": chunk["chunk_index"]
            }
        )
    )

# Upload in batches of 100
batch_size = 100
for i in range(0, len(points), batch_size):
    qdrant.upsert(
        collection_name=COLLECTION_NAME,
        points=points[i:i + batch_size]
    )

print(f"Indexed {len(points)} vectors into Qdrant")

Step 5: Build the Retrieval Function

The retrieval function embeds the incoming query and finds the top-k most similar chunks. I default to top 5 — enough context to answer most questions, small enough to keep Claude’s input tight. You can filter by doc_id or doc_title using Qdrant’s payload filtering if you need scoped search (e.g., “only search the pricing documents”).

def retrieve(query: str, top_k: int = 5, score_threshold: float = 0.70) -> list[dict]:
    """
    Embed query and return top-k semantically similar chunks.
    score_threshold filters out weak matches — tune this for your corpus.
    """
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=[query]
    ).data[0].embedding
    
    results = qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_embedding,
        limit=top_k,
        score_threshold=score_threshold,
        with_payload=True
    )
    
    return [
        {
            "text": hit.payload["text"],
            "doc_title": hit.payload["doc_title"],
            "score": round(hit.score, 4)
        }
        for hit in results
    ]

# Test it
results = retrieve("can I return a product after 30 days?")
for r in results:
    print(f"[{r['score']}] {r['doc_title']}: {r['text'][:100]}...")

Step 6: Wire It Into Claude

The retrieved chunks become injected context in Claude’s prompt. I prefer injecting into the system prompt rather than the user turn — it keeps the user’s question clean and lets you craft exactly how Claude should treat the retrieved material. This is also where your system prompt structure matters a lot for consistent agent behavior.

import anthropic

claude = anthropic.Anthropic(api_key="your-anthropic-key")

def answer_with_context(user_question: str) -> str:
    """Retrieve relevant docs and answer using Claude."""
    
    # 1. Retrieve relevant chunks
    chunks = retrieve(user_question, top_k=5)
    
    if not chunks:
        context_block = "No relevant documents found in the knowledge base."
    else:
        # Format retrieved context clearly
        context_block = "\n\n".join([
            f"[Source: {c['doc_title']} | Relevance: {c['score']}]\n{c['text']}"
            for c in chunks
        ])
    
    # 2. Build system prompt with retrieved context injected
    system_prompt = f"""You are a helpful support agent. Answer the user's question using ONLY the information in the provided context. If the context doesn't contain enough information to answer, say so explicitly — do not speculate.

## Retrieved Context
{context_block}

## Instructions
- Quote or reference specific sources when relevant
- If multiple sources conflict, note the conflict
- Keep answers concise and direct"""
    
    # 3. Call Claude — using Haiku for speed/cost on retrieval-augmented tasks
    response = claude.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_question}]
    )
    
    return response.content[0].text

# Test the full pipeline
answer = answer_with_context("What's your return policy for digital products?")
print(answer)

At current Haiku pricing (~$0.80/million input tokens), a typical retrieval-augmented query with 5 chunks averaging 400 tokens each costs roughly $0.002–0.003 per call. For a support agent handling 10,000 queries a month, that’s $20–30 in LLM costs — the embedding lookups are negligible on top of that.

Common Errors and How to Fix Them

Error 1: Score threshold kills all results

If you set score_threshold=0.70 and get empty results consistently, your corpus vocabulary or domain is diverging from what the embedding model was trained on. Technical documentation, legal text, or proprietary jargon will score lower. Fix: Drop the threshold to 0.55 and log the actual scores during development. Set your threshold at the 20th percentile of “good” results, not a fixed number.

Error 2: Qdrant collection vector size mismatch

You switch from text-embedding-3-small (1536 dims) to text-embedding-3-large (3072 dims) without recreating the collection. Qdrant will throw a dimension mismatch error. Fix: Always call recreate_collection when changing models, and store the model name in your collection metadata so you don’t forget which model was used.

Error 3: Chunking destroys sentence boundaries

Token-based chunking will split mid-sentence when a chunk boundary falls inside a sentence. This produces fragments that embed poorly and confuse Claude. Fix: Use spacy or a simple sentence splitter to snap chunk boundaries to the nearest sentence end. It adds ~10ms of processing per document but meaningfully improves retrieval quality.

import re

def snap_to_sentence_boundary(text: str) -> str:
    """Trim text to end at the last complete sentence."""
    # Simple heuristic — works for most English prose
    match = re.search(r'[.!?][^.!?]*$', text)
    if match:
        return text[:match.start() + 1]
    return text

What to Build Next

Add hybrid search. Pure semantic search misses exact matches for proper nouns, SKU codes, and version numbers like “v2.3.1” — things that are unique identifiers rather than concepts. The fix is hybrid search: run both a BM25 keyword search and your vector search in parallel, then merge results using Reciprocal Rank Fusion (RRF). Qdrant has native sparse vector support for this in recent versions. It’s the single biggest quality improvement you can make after getting the baseline pipeline working, and it’s the retrieval architecture most production RAG systems use.

If you’re scaling this to thousands of documents and want to understand how the full ingestion pipeline fits together — including PDF parsing, metadata extraction, and incremental updates — the end-to-end RAG pipeline guide covers that in full. And if you need the agent itself to have memory across sessions beyond what you can fit in context, persistent memory architecture for Claude agents is the natural next layer to add.

Frequently Asked Questions

What’s the difference between semantic search and keyword search for a knowledge base?

Keyword search matches on exact or stemmed word overlap — it finds “refund” if you searched “refund” but misses “money back” or “return payment.” Semantic search converts text to vector embeddings and matches by meaning in high-dimensional space, so “can I get a refund?” and “return policy” map to nearby vectors even though they share no words. For agent knowledge bases, semantic search dramatically reduces missed retrievals caused by phrasing variation.

Which vector database should I use for a production semantic search setup?

Qdrant is my default recommendation for self-hosted deployments — it’s fast, has a clean API, and runs on minimal infrastructure. Pinecone is the lowest-friction managed option if you don’t want to operate infrastructure. Weaviate makes sense if you need built-in hybrid search and GraphQL querying. For a detailed cost and feature comparison, see the Pinecone vs Qdrant vs Weaviate breakdown on this site.

How many chunks should I retrieve and pass to Claude?

Start with top-5 and measure answer quality. More chunks give Claude more material to work with but dilute signal with lower-relevance content — and above ~10 chunks you’re burning tokens without much benefit. If your chunks are small (under 200 tokens), you can push to 8-10. Always set a score threshold to filter out low-relevance chunks; injecting weak matches actively hurts answer quality.

Can I use a local embedding model instead of OpenAI’s API?

Yes. Models like nomic-embed-text via Ollama or bge-small-en-v1.5 via sentence-transformers run locally at zero per-call cost. The tradeoff is quality: OpenAI’s text-embedding-3-small consistently outperforms open-source models on general retrieval benchmarks by 5-15 points. For sensitive data or high-volume indexing where API costs add up, local models are a reasonable choice — just benchmark on your actual corpus before committing.

How do I update the knowledge base when documents change?

Store each chunk with a doc_id and a content hash in the Qdrant payload. On update, re-chunk and re-embed only the changed document, then delete old points by filtering on doc_id and upsert the new ones. Qdrant’s delete method accepts payload filters, making this a clean two-step operation. Avoid full re-indexing unless you’re changing embedding models — it’s slow and unnecessary.

Put this into practice

Try the Search Specialist agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Semantic Search Implementation Guide: Building Vector Search for Agent Knowledge Bases

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Semantic Search Implementation Guide: Building Vector Search for Agent Knowledge Bases

Step 1: Install Dependencies

Step 2: Prepare and Chunk Documents

Step 3: Generate Embeddings

Step 4: Index into Qdrant

Step 5: Build the Retrieval Function

Step 6: Wire It Into Claude

Common Errors and How to Fix Them

Error 1: Score threshold kills all results

Error 2: Qdrant collection vector size mismatch

Error 3: Chunking destroys sentence boundaries

What to Build Next

Frequently Asked Questions

What’s the difference between semantic search and keyword search for a knowledge base?

Which vector database should I use for a production semantic search setup?

How many chunks should I retrieve and pass to Claude?

Can I use a local embedding model instead of OpenAI’s API?

How do I update the knowledge base when documents change?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation