Overview
This chapter explores methods for both training and fine-tuning embedding models. Text embedding models convert text into dense vector representations that capture semantic meaning, enabling similarity search, clustering, and retrieval tasks.
Use a GPU for training embedding models. In Google Colab, select Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.
When to Train Embedding Models
Consider training or fine-tuning embedding models when:
- You need domain-specific embeddings (medical, legal, technical documentation)
- Existing models don’t capture your data’s nuances
- You want to optimize for specific similarity tasks
- You have labeled pairs or triplets of similar/dissimilar texts
Dataset Preparation
We’ll use the MNLI (Multi-Genre Natural Language Inference) dataset from GLUE, which contains premise-hypothesis pairs with entailment labels.
from datasets import load_dataset
# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")
Example data point:
{
'premise': 'One of our number will carry out your instructions minutely.',
'hypothesis': 'A member of my team will execute your orders with immense precision.',
'label': 0 # entailment
}
Training from Scratch
Model Initialization
Start with a base BERT model without sentence-transformers weights:
from sentence_transformers import SentenceTransformer
# Use a base model - creates mean pooling automatically
embedding_model = SentenceTransformer('bert-base-uncased')
Loss Functions
The choice of loss function depends on your data format and task requirements.
Softmax Loss
Best for classification-style training with labeled categories:
from sentence_transformers import losses
# Define the loss function with number of labels
train_loss = losses.SoftmaxLoss(
model=embedding_model,
sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
num_labels=3
)
Cosine Similarity Loss
For training with similarity scores between sentence pairs:
from datasets import Dataset
# Remap labels: (neutral/contradiction)=0, (entailment)=1
mapping = {2: 0, 1: 0, 0: 1}
train_dataset = Dataset.from_dict({
"sentence1": train_dataset["premise"],
"sentence2": train_dataset["hypothesis"],
"label": [float(mapping[label]) for label in train_dataset["label"]]
})
# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)
Training results:
Training loss progression: 0.232 → 0.169 → 0.142
Spearman correlation: 0.73
Multiple Negatives Ranking Loss
Most effective for retrieval tasks with anchor-positive-negative triplets:
import random
from tqdm import tqdm
# Filter for entailment pairs only
mnli = mnli.filter(lambda x: True if x['label'] == 0 else False)
# Prepare data with soft negatives
train_dataset = {"anchor": [], "positive": [], "negative": []}
soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)
for row, soft_negative in zip(mnli, soft_negatives):
train_dataset["anchor"].append(row["premise"])
train_dataset["positive"].append(row["hypothesis"])
train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)
# Loss function
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)
Training results:
Training loss progression: 0.345 → 0.105 → 0.069
Spearman correlation: 0.82 (best performance)
Training Configuration
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
# Define training arguments
args = SentenceTransformerTrainingArguments(
output_dir="base_embedding_model",
num_train_epochs=1,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
warmup_steps=100,
fp16=True,
eval_steps=100,
logging_steps=100,
)
# Train embedding model
trainer = SentenceTransformerTrainer(
model=embedding_model,
args=args,
train_dataset=train_dataset,
loss=train_loss,
evaluator=evaluator
)
trainer.train()
Evaluation Setup
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
# Create an embedding similarity evaluator for STS-B
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
sentences1=val_sts["sentence1"],
sentences2=val_sts["sentence2"],
scores=[score/5 for score in val_sts["label"]],
main_similarity="cosine"
)
# Evaluate the trained model
results = evaluator(embedding_model)
Fine-Tuning Existing Models
Starting from Pre-trained Model
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Use Multiple Negatives Ranking Loss
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)
# Fine-tune with same training setup
trainer = SentenceTransformerTrainer(
model=embedding_model,
args=args,
train_dataset=train_dataset,
loss=train_loss,
evaluator=evaluator
)
trainer.train()
Fine-tuning results:
Original model: Spearman 0.867
Fine-tuned model: Spearman 0.848
Training improved task-specific performance
Evaluation with MTEB
Massive Text Embedding Benchmark (MTEB) provides standardized evaluation:
from mteb import MTEB
# Choose evaluation task
evaluation = MTEB(tasks=["Banking77Classification"])
# Calculate results
results = evaluation.run(embedding_model)
Results:
{
'Banking77Classification': {
'accuracy': 0.460,
'f1': 0.458,
'accuracy_stderr': 0.0096
}
}
Best Practices
Choose the Right Loss Function
- SoftmaxLoss: Classification tasks with categories
- CosineSimilarityLoss: Pairwise similarity learning
- MultipleNegativesRankingLoss: Retrieval and semantic search (recommended)
Prepare Quality Data
- Use domain-specific text pairs
- Include hard negatives for better discrimination
- Balance positive and negative examples
Optimize Training
- Start with learning rate 2e-5
- Use warmup steps (10% of total)
- Enable fp16 for faster training
- Monitor evaluation metrics during training
Evaluate Thoroughly
- Test on multiple downstream tasks
- Use MTEB for standardized benchmarking
- Compare with baseline models
Clear VRAM between training runs to avoid out-of-memory errors:import gc
import torch
gc.collect()
torch.cuda.empty_cache()
Training Metrics Comparison
| Loss Function | Training Loss | Spearman Correlation | Use Case |
|---|
| SoftmaxLoss | 0.845 | 0.45 | Classification |
| CosineSimilarityLoss | 0.157 | 0.73 | Pairwise similarity |
| MultipleNegativesRankingLoss | 0.128 | 0.82 | Semantic search |
Next Steps
- Experiment with different base models (RoBERTa, MPNet)
- Try domain adaptation with your specific data
- Implement hard negative mining for better performance
- Evaluate on multiple MTEB tasks for comprehensive assessment