Skip to main content
Open In Colab

Overview

This chapter explores methods for both training and fine-tuning embedding models. Text embedding models convert text into dense vector representations that capture semantic meaning, enabling similarity search, clustering, and retrieval tasks.
Use a GPU for training embedding models. In Google Colab, select Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

When to Train Embedding Models

Consider training or fine-tuning embedding models when:
  • You need domain-specific embeddings (medical, legal, technical documentation)
  • Existing models don’t capture your data’s nuances
  • You want to optimize for specific similarity tasks
  • You have labeled pairs or triplets of similar/dissimilar texts

Dataset Preparation

We’ll use the MNLI (Multi-Genre Natural Language Inference) dataset from GLUE, which contains premise-hypothesis pairs with entailment labels.
from datasets import load_dataset

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")
Example data point:
{
  'premise': 'One of our number will carry out your instructions minutely.',
  'hypothesis': 'A member of my team will execute your orders with immense precision.',
  'label': 0  # entailment
}

Training from Scratch

Model Initialization

Start with a base BERT model without sentence-transformers weights:
from sentence_transformers import SentenceTransformer

# Use a base model - creates mean pooling automatically
embedding_model = SentenceTransformer('bert-base-uncased')

Loss Functions

The choice of loss function depends on your data format and task requirements.

Softmax Loss

Best for classification-style training with labeled categories:
from sentence_transformers import losses

# Define the loss function with number of labels
train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3
)

Cosine Similarity Loss

For training with similarity scores between sentence pairs:
from datasets import Dataset

# Remap labels: (neutral/contradiction)=0, (entailment)=1
mapping = {2: 0, 1: 0, 0: 1}
train_dataset = Dataset.from_dict({
    "sentence1": train_dataset["premise"],
    "sentence2": train_dataset["hypothesis"],
    "label": [float(mapping[label]) for label in train_dataset["label"]]
})

# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)
Training results:
Training loss progression: 0.232 → 0.169 → 0.142
Spearman correlation: 0.73

Multiple Negatives Ranking Loss

Most effective for retrieval tasks with anchor-positive-negative triplets:
import random
from tqdm import tqdm

# Filter for entailment pairs only
mnli = mnli.filter(lambda x: True if x['label'] == 0 else False)

# Prepare data with soft negatives
train_dataset = {"anchor": [], "positive": [], "negative": []}
soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)

for row, soft_negative in zip(mnli, soft_negatives):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)

train_dataset = Dataset.from_dict(train_dataset)

# Loss function
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)
Training results:
Training loss progression: 0.345 → 0.105 → 0.069
Spearman correlation: 0.82 (best performance)

Training Configuration

from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer

# Define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="base_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train embedding model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()

Evaluation Setup

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for STS-B
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

# Evaluate the trained model
results = evaluator(embedding_model)

Fine-Tuning Existing Models

Starting from Pre-trained Model

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Use Multiple Negatives Ranking Loss
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

# Fine-tune with same training setup
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()
Fine-tuning results:
Original model: Spearman 0.867
Fine-tuned model: Spearman 0.848
Training improved task-specific performance

Evaluation with MTEB

Massive Text Embedding Benchmark (MTEB) provides standardized evaluation:
from mteb import MTEB

# Choose evaluation task
evaluation = MTEB(tasks=["Banking77Classification"])

# Calculate results
results = evaluation.run(embedding_model)
Results:
{
  'Banking77Classification': {
    'accuracy': 0.460,
    'f1': 0.458,
    'accuracy_stderr': 0.0096
  }
}

Best Practices

1

Choose the Right Loss Function

  • SoftmaxLoss: Classification tasks with categories
  • CosineSimilarityLoss: Pairwise similarity learning
  • MultipleNegativesRankingLoss: Retrieval and semantic search (recommended)
2

Prepare Quality Data

  • Use domain-specific text pairs
  • Include hard negatives for better discrimination
  • Balance positive and negative examples
3

Optimize Training

  • Start with learning rate 2e-5
  • Use warmup steps (10% of total)
  • Enable fp16 for faster training
  • Monitor evaluation metrics during training
4

Evaluate Thoroughly

  • Test on multiple downstream tasks
  • Use MTEB for standardized benchmarking
  • Compare with baseline models
Clear VRAM between training runs to avoid out-of-memory errors:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

Training Metrics Comparison

Loss FunctionTraining LossSpearman CorrelationUse Case
SoftmaxLoss0.8450.45Classification
CosineSimilarityLoss0.1570.73Pairwise similarity
MultipleNegativesRankingLoss0.1280.82Semantic search

Next Steps

  • Experiment with different base models (RoBERTa, MPNet)
  • Try domain adaptation with your specific data
  • Implement hard negative mining for better performance
  • Evaluate on multiple MTEB tasks for comprehensive assessment

Build docs developers (and LLMs) love