Creating Text Embedding Models

Overview

This chapter explores methods for both training and fine-tuning embedding models. Text embedding models convert text into dense vector representations that capture semantic meaning, enabling similarity search, clustering, and retrieval tasks.

Use a GPU for training embedding models. In Google Colab, select Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

When to Train Embedding Models

Consider training or fine-tuning embedding models when:

You need domain-specific embeddings (medical, legal, technical documentation)
Existing models don’t capture your data’s nuances
You want to optimize for specific similarity tasks
You have labeled pairs or triplets of similar/dissimilar texts

Dataset Preparation

We’ll use the MNLI (Multi-Genre Natural Language Inference) dataset from GLUE, which contains premise-hypothesis pairs with entailment labels.

from datasets import load_dataset

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

Example data point:

{
  'premise': 'One of our number will carry out your instructions minutely.',
  'hypothesis': 'A member of my team will execute your orders with immense precision.',
  'label': 0  # entailment
}

Training from Scratch

Model Initialization

Start with a base BERT model without sentence-transformers weights:

from sentence_transformers import SentenceTransformer

# Use a base model - creates mean pooling automatically
embedding_model = SentenceTransformer('bert-base-uncased')

Loss Functions

The choice of loss function depends on your data format and task requirements.

Softmax Loss

Best for classification-style training with labeled categories:

from sentence_transformers import losses

# Define the loss function with number of labels
train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3
)

Cosine Similarity Loss

For training with similarity scores between sentence pairs:

from datasets import Dataset

# Remap labels: (neutral/contradiction)=0, (entailment)=1
mapping = {2: 0, 1: 0, 0: 1}
train_dataset = Dataset.from_dict({
    "sentence1": train_dataset["premise"],
    "sentence2": train_dataset["hypothesis"],
    "label": [float(mapping[label]) for label in train_dataset["label"]]
})

# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

Training results:

Training loss progression: 0.232 → 0.169 → 0.142
Spearman correlation: 0.73

Multiple Negatives Ranking Loss

Most effective for retrieval tasks with anchor-positive-negative triplets:

import random
from tqdm import tqdm

# Filter for entailment pairs only
mnli = mnli.filter(lambda x: True if x['label'] == 0 else False)

# Prepare data with soft negatives
train_dataset = {"anchor": [], "positive": [], "negative": []}
soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)

for row, soft_negative in zip(mnli, soft_negatives):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)

train_dataset = Dataset.from_dict(train_dataset)

# Loss function
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

Training results:

Training loss progression: 0.345 → 0.105 → 0.069
Spearman correlation: 0.82 (best performance)

Training Configuration

from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer

# Define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="base_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train embedding model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()

Evaluation Setup

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for STS-B
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

# Evaluate the trained model
results = evaluator(embedding_model)

Fine-Tuning Existing Models

Starting from Pre-trained Model

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Use Multiple Negatives Ranking Loss
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

# Fine-tune with same training setup
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()

Fine-tuning results:

Original model: Spearman 0.867
Fine-tuned model: Spearman 0.848
Training improved task-specific performance

Evaluation with MTEB

Massive Text Embedding Benchmark (MTEB) provides standardized evaluation:

from mteb import MTEB

# Choose evaluation task
evaluation = MTEB(tasks=["Banking77Classification"])

# Calculate results
results = evaluation.run(embedding_model)

Results:

{
  'Banking77Classification': {
    'accuracy': 0.460,
    'f1': 0.458,
    'accuracy_stderr': 0.0096
  }
}

Best Practices

Choose the Right Loss Function

SoftmaxLoss: Classification tasks with categories
CosineSimilarityLoss: Pairwise similarity learning
MultipleNegativesRankingLoss: Retrieval and semantic search (recommended)

Prepare Quality Data

Use domain-specific text pairs
Include hard negatives for better discrimination
Balance positive and negative examples

Optimize Training

Start with learning rate 2e-5
Use warmup steps (10% of total)
Enable fp16 for faster training
Monitor evaluation metrics during training

Evaluate Thoroughly

Test on multiple downstream tasks
Use MTEB for standardized benchmarking
Compare with baseline models

Clear VRAM between training runs to avoid out-of-memory errors:

import gc
import torch

gc.collect()
torch.cuda.empty_cache()

Training Metrics Comparison

Loss Function	Training Loss	Spearman Correlation	Use Case
SoftmaxLoss	0.845	0.45	Classification
CosineSimilarityLoss	0.157	0.73	Pairwise similarity
MultipleNegativesRankingLoss	0.128	0.82	Semantic search

Next Steps

Experiment with different base models (RoBERTa, MPNet)
Try domain adaptation with your specific data
Implement hard negative mining for better performance
Evaluate on multiple MTEB tasks for comprehensive assessment

Get Started

Foundations

Text Understanding

Text Generation

Retrieval & Multimodal

Fine-Tuning

Overview

When to Train Embedding Models

Dataset Preparation

Training from Scratch

Model Initialization

Loss Functions

Softmax Loss

Cosine Similarity Loss

Multiple Negatives Ranking Loss

Training Configuration

Evaluation Setup

Fine-Tuning Existing Models

Starting from Pre-trained Model

Evaluation with MTEB

Best Practices

Training Metrics Comparison

Next Steps

Build docs developers (and LLMs) love

Get Started

Foundations

Text Understanding

Text Generation

Retrieval & Multimodal

Fine-Tuning

Documentation Index

​Overview

​When to Train Embedding Models

​Dataset Preparation

​Training from Scratch

​Model Initialization

​Loss Functions

​Softmax Loss

​Cosine Similarity Loss

​Multiple Negatives Ranking Loss

​Training Configuration

​Evaluation Setup

​Fine-Tuning Existing Models

​Starting from Pre-trained Model

​Evaluation with MTEB

​Best Practices

​Training Metrics Comparison

​Next Steps

Build docs developers (and LLMs) love

Overview

When to Train Embedding Models

Dataset Preparation

Training from Scratch

Model Initialization

Loss Functions

Softmax Loss

Cosine Similarity Loss

Multiple Negatives Ranking Loss

Training Configuration

Evaluation Setup

Fine-Tuning Existing Models

Starting from Pre-trained Model

Evaluation with MTEB

Best Practices

Training Metrics Comparison

Next Steps