By the end of this tutorial, you’ll have a custom embedding model fine-tuned on your domain’s vocabulary and concepts, integrated into a semantic search pipeline that outperforms text-embedding-ada-002 on your specific data. Domain-specific embeddings training is the difference between a RAG system that returns vaguely related chunks and one that actually understands that “MI” means myocardial infarction in a cardiology context — not Michigan.
Generic embeddings are trained on the internet. Your documents aren’t the internet. If you’re building an agent that searches legal contracts, medical literature, financial filings, or any other specialized corpus, you’re leaving significant retrieval accuracy on the table by using text-embedding-3-small out of the box. The fix takes less than 24 hours and doesn’t require a PhD.
Here’s the full process:
- Install dependencies — Set up sentence-transformers, datasets, and training utilities
- Prepare your training data — Generate contrastive pairs from your domain corpus
- Configure the base model — Choose and load the right pre-trained checkpoint
- Run fine-tuning with MultipleNegativesRankingLoss — Train using contrastive learning
- Evaluate against the baseline — Measure recall improvement on your held-out set
- Integrate into your vector search pipeline — Swap in the custom model and re-index
Why Generic Embeddings Fail on Specialist Domains
OpenAI’s and Cohere’s embedding models are excellent general-purpose tools. But “general purpose” means the training data distribution reflects the open web — Wikipedia, Common Crawl, StackOverflow. If your documents use terminology that’s rare in that distribution (and most professional domains do), the embedding space doesn’t separate your concepts cleanly.
A concrete example: in a cybersecurity corpus, “lateral movement” is a specific attack technique. In a generic embedding space, it clusters near “horizontal movement” and “side-stepping” — which is semantically plausible but retrieval-useless. After fine-tuning on security incident reports, lateral movement clusters tightly with “pass-the-hash”, “credential harvesting”, and other attack vectors where it actually belongs.
If you’ve already built a RAG pipeline and are hitting accuracy issues, this tutorial is a direct fix. If you haven’t built the pipeline yet, building a RAG pipeline from scratch is a good starting point before you come back here to tune the retrieval layer.
Step 1: Install Dependencies
pip install sentence-transformers==3.0.1 datasets==2.20.0 \
accelerate==0.33.0 torch==2.3.1 faiss-cpu==1.8.0 \
huggingface-hub==0.24.0 evaluate==0.4.2
Pin these versions. sentence-transformers 3.x changed the training API substantially from 2.x, and if you’re pulling from a requirements file that worked six months ago you’ll hit cryptic errors. The faiss-cpu package is for evaluation; swap for faiss-gpu if you have a CUDA box.
Step 2: Prepare Your Training Data
This is where most tutorials wave their hands. You need contrastive pairs: (anchor, positive) where anchor and positive are semantically equivalent or closely related in your domain. The model learns to pull these together in embedding space while pushing apart random negatives sampled from the same batch.
Three reliable strategies for generating pairs without manual labeling:
- Adjacent sentences — consecutive sentences in a document are highly likely to be topically related
- Question-chunk pairs — generate questions from your chunks using an LLM, then pair question with source chunk
- Title-body pairs — document title or section heading paired with its body text
The LLM-generated question approach (sometimes called “InPars” or “GPL”) is the highest quality option. It costs roughly $0.50–$2 per 1,000 pairs using Claude Haiku, depending on chunk size. For 10,000 training pairs — a solid starting point — budget $5–$20.
import anthropic
import json
from pathlib import Path
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
def generate_questions_for_chunk(chunk: str, n: int = 3) -> list[str]:
"""Generate n questions that this chunk would answer."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{
"role": "user",
"content": (
f"Generate {n} specific questions that the following text answers. "
f"Return only a JSON array of strings. No explanation.\n\n"
f"Text: {chunk}"
)
}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return [] # handle gracefully — LLMs occasionally return malformed JSON
def build_training_pairs(chunks: list[str]) -> list[tuple[str, str]]:
pairs = []
for chunk in chunks:
questions = generate_questions_for_chunk(chunk)
for q in questions:
pairs.append((q, chunk)) # (anchor=question, positive=chunk)
return pairs
# Load your domain documents here
chunks = Path("domain_chunks.txt").read_text().split("\n---\n")
training_pairs = build_training_pairs(chunks[:3000]) # ~9k pairs from 3k chunks
Step 3: Configure the Base Model
You’re not training from scratch — you’re fine-tuning a pre-trained checkpoint. This is the 24-hour part; from-scratch training takes weeks and a cluster. The base model you choose matters:
BAAI/bge-base-en-v1.5— my default pick. 768-dim, 110M params, strong baseline performance, Apache 2.0 licensesentence-transformers/all-MiniLM-L6-v2— 384-dim, 22M params, faster and cheaper to serve, slightly lower ceilingintfloat/e5-base-v2— competitive alternative, but requires “query: ” / “passage: ” prefixes which adds complexity
I’d pick bge-base-en-v1.5 for most production use cases unless you’re serving at high volume and latency is your primary constraint, in which case drop to MiniLM-L6.
from sentence_transformers import SentenceTransformer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from datasets import Dataset
# Load base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Convert pairs to HuggingFace Dataset format
train_data = Dataset.from_dict({
"anchor": [p[0] for p in training_pairs],
"positive": [p[1] for p in training_pairs],
})
# Hold out 10% for evaluation
split = train_data.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]
Step 4: Run Fine-Tuning with MultipleNegativesRankingLoss
MultipleNegativesRankingLoss (MNRL) is the right loss function here. It treats every other item in the batch as a negative example, which means you get batch_size - 1 negatives for free per pair. This is why your batch size matters more than usual — 64 or 128 beats 16 significantly.
from sentence_transformers import SentenceTransformerTrainer
# Loss function — works with (anchor, positive) pairs
loss = MultipleNegativesRankingLoss(model=model)
# Training configuration
args = SentenceTransformerTrainingArguments(
output_dir="models/domain-embeddings-v1",
num_train_epochs=3,
per_device_train_batch_size=64, # critical for MNRL — bigger = more negatives
per_device_eval_batch_size=64,
learning_rate=2e-5,
warmup_ratio=0.1,
evaluation_strategy="steps",
eval_steps=200,
save_strategy="steps",
save_steps=200,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
logging_steps=50,
fp16=True, # remove if running on CPU or MPS
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
)
trainer.train()
model.save_pretrained("models/domain-embeddings-v1")
On a single A100 with 9,000 training pairs and 3 epochs, this runs in about 45 minutes. On a T4 (Colab free tier or ~$0.35/hr on Lambda Labs), expect 3–4 hours. On CPU only, it’s going to take the full 24 hours — which is why renting a GPU box for one run is usually worth $5–$15.
Step 5: Evaluate Against the Baseline
Don’t ship without measuring. Build a held-out evaluation set of 100–200 queries with known-relevant documents. Calculate Recall@k for k=1, 3, 5, 10. That’s what matters for RAG — does the right chunk appear in the top-k retrieved results?
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
def recall_at_k(model: SentenceTransformer, queries: list[str],
corpus: list[str], relevant_ids: list[int], k: int = 5) -> float:
"""Compute Recall@k. relevant_ids[i] is the index of the relevant doc for query i."""
corpus_embeddings = model.encode(corpus, normalize_embeddings=True)
query_embeddings = model.encode(queries, normalize_embeddings=True)
# Build FAISS index
index = faiss.IndexFlatIP(corpus_embeddings.shape[1]) # inner product = cosine on normalized vecs
index.add(corpus_embeddings.astype(np.float32))
_, indices = index.search(query_embeddings.astype(np.float32), k)
hits = sum(
1 for i, relevant_id in enumerate(relevant_ids)
if relevant_id in indices[i]
)
return hits / len(queries)
# Compare baseline vs fine-tuned
baseline = SentenceTransformer("BAAI/bge-base-en-v1.5")
finetuned = SentenceTransformer("models/domain-embeddings-v1")
baseline_r5 = recall_at_k(baseline, eval_queries, corpus, relevant_ids, k=5)
finetuned_r5 = recall_at_k(finetuned, eval_queries, corpus, relevant_ids, k=5)
print(f"Baseline Recall@5: {baseline_r5:.3f}")
print(f"Fine-tuned Recall@5: {finetuned_r5:.3f}")
print(f"Improvement: {(finetuned_r5 - baseline_r5) / baseline_r5 * 100:.1f}%")
In practice, I’ve seen 15–35% Recall@5 improvements on specialist domains after fine-tuning. Legal and medical corpora tend to show the largest gains because their terminology is most underrepresented in generic training data. This directly reduces hallucinations downstream — your Claude agent isn’t fabricating answers because it’s getting the right context. If you’re hitting hallucination problems, this is often the root cause, and grounding strategies at the retrieval layer are more effective than prompt engineering alone.
Step 6: Integrate Into Your Vector Search Pipeline
Swapping the model into an existing pipeline is two changes: update the embedding function, re-index your corpus.
from sentence_transformers import SentenceTransformer
import qdrant_client
from qdrant_client.models import PointStruct, VectorParams, Distance
# Load your fine-tuned model
embedder = SentenceTransformer("models/domain-embeddings-v1")
# Connect to your vector store (Qdrant shown here; same pattern for Pinecone/Weaviate)
client = qdrant_client.QdrantClient(host="localhost", port=6333)
# Recreate collection with correct dimensions
COLLECTION_NAME = "domain_docs_v2"
VECTOR_DIM = embedder.get_sentence_embedding_dimension() # 768 for bge-base
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
)
# Re-embed and index your corpus
def index_documents(docs: list[dict], batch_size: int = 256):
for i in range(0, len(docs), batch_size):
batch = docs[i:i + batch_size]
texts = [d["text"] for d in batch]
embeddings = embedder.encode(texts, normalize_embeddings=True)
points = [
PointStruct(
id=i + j,
vector=embeddings[j].tolist(),
payload={"text": batch[j]["text"], "source": batch[j]["source"]}
)
for j in range(len(batch))
]
client.upsert(collection_name=COLLECTION_NAME, points=points)
index_documents(your_docs)
The re-indexing step is often overlooked. Your new model produces different embedding vectors — your existing index built with the generic model is incompatible. Full re-index is mandatory. For large corpora (1M+ chunks), you’ll want to do this in parallel with your existing index serving traffic. The semantic search implementation guide covers zero-downtime re-indexing strategies in detail.
Common Errors
Training loss doesn’t decrease after epoch 1
Usually a batch size problem. If your batch is too small (under 32), MNRL doesn’t have enough in-batch negatives and the loss landscape is noisy. Increase batch size first. If you’re memory-constrained, enable gradient checkpointing: add gradient_checkpointing=True to your training args.
Fine-tuned model performs worse than baseline on evaluation
Your training pairs have noise or near-duplicates. Common cause: adjacent-sentence pairs where the sentences are from a table of contents or boilerplate headers. Filter out pairs where either the anchor or positive is under 30 characters. Also check that your eval set wasn’t contaminated — if any eval queries were used during training pair generation, your numbers are artificially inflated and masking the real problem.
CUDA out of memory during training
Reduce per_device_train_batch_size in steps of 16 until it fits. For A100 (40GB), 128 is usually fine. For T4 (16GB), you’re looking at 32–48 max with fp16=True. If you still OOM, switch to bf16=True on Ampere+ GPUs — it’s numerically more stable than fp16 for fine-tuning and has the same memory footprint. Also add dataloader_num_workers=4 to avoid GPU starvation.
What to Build Next
The natural extension is hard negative mining. Right now you’re using random in-batch negatives. Hard negatives are examples that are superficially similar but semantically different — the genuinely tricky cases your model needs to learn. Concretely: for each anchor, embed your full corpus, find the top-10 nearest neighbors that are NOT the true positive, and add those as explicit negatives in a TripletLoss or CachedMultipleNegativesRankingLoss setup. This typically pushes Recall@5 another 5–10 percentage points on specialist domains.
You’d run the base model trained in this tutorial to mine the hard negatives, then train a second pass with those harder examples. Two-stage training like this is how the best production embedding models are built. Pair it with proper fallback logic — if you’re hitting edge cases where retrieval quality degrades unexpectedly, the graceful degradation patterns for LLM pipelines apply equally to embedding retrieval steps.
When to Use This vs. Just Paying for a Managed Embedding API
Use this approach if: you have a corpus with specialized terminology, your current Recall@5 is under 70%, you’re embedding more than 50M tokens/month (at which point self-hosting the model pays for itself vs. API costs), or you need the model on-prem for compliance reasons.
Stick with OpenAI or Cohere if: your domain is well-represented in general training data, you’re in early prototyping and want zero operational overhead, or you don’t have a team member who can maintain a self-hosted model over time. The managed APIs are genuinely good — this tutorial is for when they’re not good enough for your specific use case.
Solo founders on a tight timeline: start with bge-base-en-v1.5 and the question-generation approach for training pairs. You can have this running in a weekend. Teams with ML capacity: invest the extra day in hard negative mining — the Recall improvement compounds significantly when you have thousands of daily queries.
Domain-specific embeddings training is one of the highest-leverage improvements you can make to a RAG system once the basic pipeline is working. It’s underused because the tooling looked complex a year ago. With sentence-transformers 3.x, it’s genuinely approachable in under 24 hours.
Frequently Asked Questions
How much training data do I need for domain-specific embeddings training?
You can see meaningful improvement with as few as 1,000–2,000 contrastive pairs, but 5,000–15,000 is the practical sweet spot for most domains. Below 1,000, results are inconsistent. Above 50,000, you’re likely hitting diminishing returns unless you also add hard negative mining. Quality matters more than quantity — 3,000 clean question-chunk pairs beats 15,000 noisy adjacent-sentence pairs.
Can I fine-tune embedding models on a CPU, or do I need a GPU?
Technically yes, but it’s not practical for the 24-hour timeline. CPU training for 9,000 pairs at 3 epochs takes 18–24 hours on a modern laptop. A rented T4 GPU on Lambda Labs or Google Colab Pro costs $2–$5 for the same run and finishes in 3–4 hours. A100 instances cut that to under an hour. Rent a GPU for the training run — it’s a one-time cost per model version.
What’s the difference between fine-tuning an embedding model and just using a larger embedding model?
Scale and domain coverage are different problems. A larger model like text-embedding-3-large has better general coverage but still doesn’t understand your domain-specific terminology any better than a smaller general model. Fine-tuning reshapes the embedding space specifically around your domain concepts — you’re not just getting more capacity, you’re getting different geometry. In practice, a fine-tuned bge-base routinely outperforms a generic text-embedding-3-large on domain retrieval tasks.
Do I need to re-embed my entire corpus every time I fine-tune?
Yes, always. The fine-tuned model produces different vector representations — your existing index is built with a different model’s geometry and the vectors are incompatible. There’s no way around a full re-index. The practical approach is to keep your previous index live while the new one builds, then do an atomic swap at the application layer once the new index is validated. Expect re-indexing to take roughly 10–20 minutes per million chunks on a modern CPU, or 2–5 minutes on GPU.
Can I use this with OpenAI’s API or does it only work with self-hosted models?
Fine-tuning is only available for self-hosted open models via the sentence-transformers / Hugging Face ecosystem. OpenAI doesn’t offer embedding model fine-tuning through their API (as of mid-2025). If you need to stay on OpenAI’s API for other reasons, your alternative is improving retrieval through query expansion, hypothetical document embeddings (HyDE), or reranking rather than model fine-tuning.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

