If you’ve shipped a RAG pipeline and noticed your retrieval quality tanking on domain-specific queries — legal contracts, medical notes, internal product documentation — you already know the problem. General-purpose embeddings like text-embedding-ada-002 or all-MiniLM-L6-v2 were trained on the open web, not your corpus. Embedding model training on your own domain data is the fix, and HuggingFace’s tooling in 2024 has made it fast enough to go from zero to a deployable custom model in a single working day. This article walks through the exact pipeline: dataset creation, fine-tuning with Sentence Transformers v3, and evaluation — no hand-waving about steps that actually take a week.
Why General Embeddings Fail on Domain Data
The failure mode is specific and predictable. You embed a query like “indemnification clause limitation of liability” and retrieve a paragraph about general contract definitions instead of the actual indemnification section. The cosine similarity scores are close — maybe 0.71 vs 0.68 — so the model isn’t obviously broken, it’s just subtly wrong in ways that compound through your whole retrieval chain.
General embeddings have no representation of how concepts in your domain cluster together. A healthcare model doesn’t know that “MI” and “myocardial infarction” should be near-identical vectors. A codebase assistant doesn’t know that useState and “React state hook” are the same thing. You can try prompt engineering around this, but you’re fighting the embedding space itself.
The good news: you don’t need a dataset of 100k labeled pairs to fix this. HuggingFace’s current training stack — specifically Sentence Transformers v3 with the new training API — can produce meaningful improvements with as few as 1,000–5,000 training pairs, and generating those pairs from your existing documents is now a solved problem.
Dataset Creation: Synthetic Pairs from Your Own Corpus
This is where most teams get stuck because they assume they need human-annotated data. You don’t — not for a first-pass model. The approach that works in practice is synthetic pair generation using an LLM to create (query, passage) pairs from your documents.
Generating Training Pairs with GPT-4o-mini or Claude Haiku
The pattern is simple: chunk your documents, then prompt a cheap LLM to generate a realistic query that the chunk would answer. At roughly $0.00015 per 1K input tokens for GPT-4o-mini, generating 5,000 pairs from typical 512-token chunks costs around $0.40 in API fees. Claude Haiku is comparable.
from openai import OpenAI
import json
client = OpenAI()
def generate_query_for_passage(passage: str) -> str:
"""Generate a realistic retrieval query for a given passage."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You generate realistic search queries that a user would type "
"to find the given passage. Return only the query, no explanation."
)
},
{
"role": "user",
"content": f"Passage:\n{passage}\n\nGenerate one search query:"
}
],
temperature=0.7,
max_tokens=80
)
return response.choices[0].message.content.strip()
# Build your dataset
def build_training_dataset(passages: list[str]) -> list[dict]:
dataset = []
for passage in passages:
query = generate_query_for_passage(passage)
dataset.append({
"query": query,
"positive": passage # the passage that answers the query
})
return dataset
One thing the documentation glosses over: you need hard negatives, not just positives, or your model will learn almost nothing useful. Hard negatives are passages that look relevant but aren’t the correct answer. The easiest way to generate them is to use your existing base embedding model to retrieve the top-5 passages for each query, then drop the true positive — what’s left are your hard negatives.
from sentence_transformers import SentenceTransformer
import numpy as np
def add_hard_negatives(
dataset: list[dict],
all_passages: list[str],
base_model_name: str = "all-MiniLM-L6-v2",
num_negatives: int = 3
) -> list[dict]:
model = SentenceTransformer(base_model_name)
# Embed all passages once — don't do this inside the loop
passage_embeddings = model.encode(all_passages, batch_size=64, show_progress_bar=True)
for item in dataset:
query_embedding = model.encode(item["query"])
# Cosine similarity against all passages
scores = np.dot(passage_embeddings, query_embedding) / (
np.linalg.norm(passage_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
# Get top-10, exclude the true positive
top_indices = np.argsort(scores)[::-1][:10]
negatives = [
all_passages[i] for i in top_indices
if all_passages[i] != item["positive"]
][:num_negatives]
item["negatives"] = negatives
return dataset
Fine-Tuning with Sentence Transformers v3
Sentence Transformers v3 shipped a redesigned training API in early 2024 that’s significantly cleaner than the old SentenceTransformerTrainer approach. It integrates with HuggingFace’s Trainer under the hood, which means you get proper evaluation callbacks, gradient checkpointing, and checkpoint saving without writing boilerplate.
Setting Up the Training Pipeline
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from datasets import Dataset
# Load your base model — BAAI/bge-base-en-v1.5 is a strong starting point
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Convert your dataset to HuggingFace Dataset format
def prepare_hf_dataset(dataset: list[dict]) -> Dataset:
rows = []
for item in dataset:
for neg in item.get("negatives", []):
rows.append({
"anchor": item["query"],
"positive": item["positive"],
"negative": neg
})
return Dataset.from_list(rows)
train_dataset = prepare_hf_dataset(training_data)
# MultipleNegativesRankingLoss works well for retrieval tasks
# It treats other items in the batch as implicit negatives too
loss = MultipleNegativesRankingLoss(model)
args = SentenceTransformerTrainingArguments(
output_dir="./custom-embedding-model",
num_train_epochs=3,
per_device_train_batch_size=32, # bigger batches = more implicit negatives
gradient_accumulation_steps=2,
learning_rate=2e-5,
warmup_ratio=0.1,
bf16=True, # use bf16 if on Ampere GPU or newer
batch_sampler=BatchSamplers.NO_DUPLICATES, # prevents query from being its own negative
eval_strategy="steps",
eval_steps=100,
save_steps=100,
logging_steps=20,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
model.save_pretrained("./custom-embedding-model")
On a single A100 (available on Lambda Labs at roughly $1.10/hr), 5,000 training pairs for 3 epochs takes about 25 minutes. On a T4 (Colab Pro or cheaper cloud), expect 60–90 minutes. The full fine-tune including dataset generation will cost you $3–8 in compute and API fees if you’re careful about batching.
Base Model Selection Matters More Than You Think
Don’t start from all-MiniLM-L6-v2 unless you’re severely constrained on inference latency. It’s fast (384 dimensions, ~22M params) but you’re leaving a lot of performance on the table. My current recommendation for most production use cases:
- BAAI/bge-base-en-v1.5 — 768 dimensions, strong baseline, good fine-tuning behavior. Start here.
- BAAI/bge-small-en-v1.5 — if you need faster inference and 384 dims is acceptable.
- nomic-ai/nomic-embed-text-v1.5 — 768 dims, supports Matryoshka representation (you can truncate to 256 dims at query time without retraining). Genuinely useful if your index is large.
- intfloat/e5-large-v2 — better out-of-the-box quality but slower to fine-tune and heavier at inference. Worth it for offline batch workloads.
Avoid fine-tuning OpenAI’s embedding models — you can’t, they’re closed. If you’re using text-embedding-3-small and it’s not working for your domain, your only option is prompt engineering or switching to an open model.
Evaluation That Actually Tells You Something
Training loss going down doesn’t mean your retrieval is improving. You need retrieval-specific metrics: NDCG@10, MRR@10, and Recall@k against a held-out evaluation set. Sentence Transformers provides this via InformationRetrievalEvaluator.
from sentence_transformers.evaluation import InformationRetrievalEvaluator
# Build a small held-out eval set — 200-500 query/passage pairs is enough
# Format: queries dict, corpus dict, relevant_docs dict
queries = {str(i): item["query"] for i, item in enumerate(eval_data)}
corpus = {str(i): item["positive"] for i, item in enumerate(eval_data)}
relevant_docs = {str(i): {str(i)} for i in range(len(eval_data))} # each query maps to its passage
evaluator = InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant_docs,
name="domain-eval",
score_functions={"cosine": lambda x, y: (x @ y.T)}
)
# Run against both base model and fine-tuned model to compare
base_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
fine_tuned = SentenceTransformer("./custom-embedding-model")
print("Base model:", evaluator(base_model))
print("Fine-tuned:", evaluator(fine_tuned))
In real domain fine-tuning projects I’ve run, NDCG@10 improvements of 8–20 percentage points are typical when the domain is genuinely specialized. If you’re seeing less than 5pp improvement, your training data probably doesn’t reflect real user queries — go back and make the synthetic queries more realistic, or collect some actual search logs.
Deployment and What Breaks in Production
Once you’ve got a model you’re happy with, push it to the HuggingFace Hub (private repo is fine) and serve it. Two practical options:
- HuggingFace Inference Endpoints — easiest path, ~$0.06/hr for a CPU endpoint on a small model, auto-scales to zero. Works for low-to-medium throughput.
- Self-hosted with FastAPI + sentence-transformers — more control, better cost at scale. A single A10G can handle thousands of embedding requests per second for a bge-base-sized model.
The thing that actually breaks in production: embedding dimension mismatches when you swap out your model but forget to rebuild your vector index. If you’re using Pinecone, Weaviate, or pgvector, you need to re-index your entire corpus when you upgrade your embedding model. Build this into your deployment checklist — it sounds obvious but it catches teams off guard every time.
Also watch for query-document length asymmetry. Models like BGE are trained with short queries against longer passages. If you embed long queries (e.g., a full paragraph) against short passages, you’ll get degraded similarity scores. Keep your query embedding inputs short and document-like, or use an asymmetric model explicitly designed for this.
When to Do This vs. When Not To
Do custom embedding model training if:
- Your domain has specialized vocabulary that general models butcher (legal, medical, scientific, code)
- You have at least a few hundred domain documents to generate training pairs from
- You’re running retrieval at scale where a 10% accuracy improvement has real business impact
- You can tolerate a one-time 3–8 hour setup investment
Skip it and stick with general embeddings if:
- Your retrieval corpus is small (<500 documents) — sparse retrieval like BM25 often wins here
- You’re still experimenting with your product and the retrieval schema changes weekly
- Your domain language is close enough to standard English that general models already perform well
For solo founders building an initial RAG product: start with BAAI/bge-base-en-v1.5 out of the box. Run it for a month, collect real user queries that failed, then use those as the foundation for your fine-tuning dataset. You’ll get far better training signal from 500 real failure cases than from 5,000 synthetic pairs generated blind.
For teams with an established product and measurable retrieval metrics: run the synthetic pair generation pipeline now, build a baseline eval set, and treat embedding model training as a recurring improvement cycle rather than a one-time project. The tooling is mature enough that each iteration should take a day, not a sprint.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

