Embedding Models

Overview

REMem supports multiple embedding models for encoding text into dense vectors. The embedding model is specified via the embedding_model_name parameter in BaseConfig.

Supported Models

NV-Embed-v2 (Default)

NVIDIA’s state-of-the-art embedding model with 4096-dimensional embeddings.

from remem.utils.config_utils import BaseConfig

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",  # Default
    embedding_batch_size=16,
    embedding_max_seq_len=2048
)

Features:

4096-dimensional embeddings
Supports instruction-based encoding
Multi-GPU support with automatic device mapping
Requires local GPU inference

OpenAI Embeddings

Use OpenAI’s hosted embedding models:

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",  # 3072 dimensions
    # OR
    # embedding_model_name="text-embedding-3-small",  # 1536 dimensions
    # embedding_model_name="text-embedding-ada-002",  # 1536 dimensions
    
    embedding_batch_size=16
)

Features:

Cloud-based (no local GPU required)
Automatic caching via SQLite
Handles content filtering gracefully
Parallel encoding for faster throughput

GritLM

Unified model for both retrieval and generation:

config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)

Features:

Can perform both embedding and text generation
Instruction-based encoding
Multi-GPU support

Custom OpenAI-Compatible Servers

Use local embedding servers with OpenAI-compatible APIs:

config = BaseConfig(
    embedding_model_name="custom-model-name",
    llm_base_url="http://localhost:8001/v1/"
)

Configuration Options

Batch Size

Control encoding throughput:

config = BaseConfig(
    embedding_batch_size=16,  # Default: 16
    # Increase for faster encoding (if GPU memory allows)
    # Decrease if running out of memory
)

Sequence Length

Set maximum input length:

config = BaseConfig(
    embedding_max_seq_len=2048,  # Default: 2048 tokens
    # Adjust based on your document lengths
)

Normalization

Control whether embeddings are normalized:

config = BaseConfig(
    embedding_return_as_normalized=True,  # Default: True
    # Normalized embeddings enable cosine similarity via dot product
)

Using NV-Embed-v2

Installation

Install dependencies for NV-Embed-v2:

pip install transformers torch pynvml

Multi-GPU Setup

NV-Embed-v2 automatically uses multiple GPUs:

import os

# Specify visible GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2"
)

From src/remem/embedding_model/NVEmbedV2.py:21-53, the model checks GPU usage and distributes across available devices:

# Automatic GPU allocation based on free memory
# GPUs with >10% usage are excluded
# Remaining GPUs share the embedding workload

Instruction-Based Encoding

NV-Embed-v2 supports task-specific instructions:

# Internal usage (handled by REMem)
embeddings = embedding_model.batch_encode(
    texts=["query text"],
    instruction="Retrieve relevant passages"  # Optional instruction
)

Using OpenAI Embeddings

Setup API Key

export OPENAI_API_KEY="your-api-key-here"

Basic Usage

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    embedding_batch_size=100  # OpenAI allows larger batches
)

Caching

OpenAI embeddings are automatically cached to reduce API costs:

# Cache location (auto-created):
# outputs/{dataset}/embedding_cache/{model_name}_embedding_cache.sqlite

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    dataset="musique"  # Creates outputs/musique/embedding_cache/
)

Azure OpenAI

Use Azure-hosted OpenAI models:

export AZURE_OPENAI_API_KEY="your-azure-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export OPENAI_API_VERSION="2024-02-15-preview"

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    use_azure=True
)

Parallel Encoding

OpenAI embeddings support parallel processing:

# From src/remem/embedding_model/openai_embedding_client.py:283-309
# Automatic parallel encoding for large batches
# Each text is encoded independently to prevent batch failures

Using GritLM

Installation

pip install gritlm

Configuration

config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)

Instruction Format

GritLM uses a specific instruction format:

# From src/remem/embedding_model/GritLM.py:71-72
# Format: "<|user|>\n{instruction}\n<|embed|>\n"
# Or just: "<|embed|>\n" if no instruction

Custom Embedding Servers

Local Server Setup

Run a local embedding server with OpenAI-compatible API:

# Example with sentence-transformers server
python -m remem.embedding_model.sentence_transformer_server \
    --model nvidia/NV-Embed-v2 \
    --port 8001

Client Configuration

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",  # Model name
    llm_base_url="http://localhost:8001/v1/"  # Server URL
)

When using custom servers, ensure the server is running before initializing REMem.

Performance Optimization

GPU Memory Management

For NV-Embed-v2, optimize GPU allocation:

import os

# Use specific GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=8  # Reduce if OOM
)

Batch Size Tuning

Optimal batch sizes by model:

# NV-Embed-v2 (local GPU)
config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=16  # Default, adjust based on GPU memory
)

# OpenAI (API)
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    embedding_batch_size=100  # Larger batches for API calls
)

# GritLM (local GPU)
config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)

Caching for OpenAI

Maximize cache hits to reduce costs:

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    force_index_from_scratch=False,  # Reuse cached embeddings
    dataset="my_dataset"  # Consistent dataset name for cache
)

Embedding Dimensions

Different models produce different embedding sizes:

Model	Dimensions
nvidia/NV-Embed-v2	4096
text-embedding-3-large	3072
text-embedding-3-small	1536
text-embedding-ada-002	1536
GritLM/GritLM-7B	4096

From src/remem/embedding_model/openai_embedding_client.py:24-35:

def _get_embedding_dimension(embedding_model_name: str) -> int:
    if "text-embedding-3-large" in embedding_model_name:
        return 3072
    elif "text-embedding-3-small" in embedding_model_name:
        return 1536
    elif "Qwen3-Embedding-8B" in embedding_model_name:
        return 4096
    elif "NV-Embed-v2" in embedding_model_name:
        return 4096
    # ...

Troubleshooting

Out of Memory (OOM) Errors

Reduce batch size or use fewer GPUs:

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=8,  # Reduced from 16
)

OpenAI Rate Limits

The client automatically retries with exponential backoff:

# From openai_embedding_client.py:54-87
# Automatic retry with backoff:
# - Base delay: 1 second
# - Max delay: 60 seconds
# - Exponential factor: 2
# - Max retries: 5 (configurable)

Content Filtering (OpenAI)

OpenAI may reject certain content. REMem creates zero embeddings as fallback:

# From openai_embedding_client.py:367-371
# Automatically handled - creates zero vector
# Logs warning but continues processing

Get Started

Core Concepts

Guides

Customization

Benchmarks

Documentation Index

​Overview

​Supported Models

​NV-Embed-v2 (Default)

​OpenAI Embeddings

​GritLM

​Custom OpenAI-Compatible Servers

​Configuration Options

​Batch Size

​Sequence Length

​Normalization

​Using NV-Embed-v2

​Installation

​Multi-GPU Setup

​Instruction-Based Encoding

​Using OpenAI Embeddings

​Setup API Key

​Basic Usage

​Caching

​Azure OpenAI

​Parallel Encoding

​Using GritLM

​Installation

​Configuration

​Instruction Format

​Custom Embedding Servers

​Local Server Setup

​Client Configuration

​Performance Optimization

​GPU Memory Management

​Batch Size Tuning

​Caching for OpenAI

​Embedding Dimensions

​Troubleshooting

​Out of Memory (OOM) Errors

​OpenAI Rate Limits

​Content Filtering (OpenAI)

​Next Steps

Configuration

Indexing

Build docs developers (and LLMs) love

Overview

Supported Models

NV-Embed-v2 (Default)

OpenAI Embeddings

GritLM

Custom OpenAI-Compatible Servers

Configuration Options

Batch Size

Sequence Length

Normalization

Using NV-Embed-v2

Installation

Multi-GPU Setup

Instruction-Based Encoding

Using OpenAI Embeddings

Setup API Key

Basic Usage

Caching

Azure OpenAI

Parallel Encoding

Using GritLM

Installation

Configuration

Instruction Format

Custom Embedding Servers

Local Server Setup

Client Configuration

Performance Optimization

GPU Memory Management

Batch Size Tuning

Caching for OpenAI

Embedding Dimensions

Troubleshooting

Out of Memory (OOM) Errors

OpenAI Rate Limits

Content Filtering (OpenAI)

Next Steps