Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/intuit-ai-research/REMem/llms.txt

Use this file to discover all available pages before exploring further.

Overview

REMem supports multiple embedding models for encoding text into dense vectors. The embedding model is specified via the embedding_model_name parameter in BaseConfig.

Supported Models

NV-Embed-v2 (Default)

NVIDIA’s state-of-the-art embedding model with 4096-dimensional embeddings.
from remem.utils.config_utils import BaseConfig

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",  # Default
    embedding_batch_size=16,
    embedding_max_seq_len=2048
)
Features:
  • 4096-dimensional embeddings
  • Supports instruction-based encoding
  • Multi-GPU support with automatic device mapping
  • Requires local GPU inference

OpenAI Embeddings

Use OpenAI’s hosted embedding models:
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",  # 3072 dimensions
    # OR
    # embedding_model_name="text-embedding-3-small",  # 1536 dimensions
    # embedding_model_name="text-embedding-ada-002",  # 1536 dimensions
    
    embedding_batch_size=16
)
Features:
  • Cloud-based (no local GPU required)
  • Automatic caching via SQLite
  • Handles content filtering gracefully
  • Parallel encoding for faster throughput

GritLM

Unified model for both retrieval and generation:
config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)
Features:
  • Can perform both embedding and text generation
  • Instruction-based encoding
  • Multi-GPU support

Custom OpenAI-Compatible Servers

Use local embedding servers with OpenAI-compatible APIs:
config = BaseConfig(
    embedding_model_name="custom-model-name",
    llm_base_url="http://localhost:8001/v1/"
)

Configuration Options

Batch Size

Control encoding throughput:
config = BaseConfig(
    embedding_batch_size=16,  # Default: 16
    # Increase for faster encoding (if GPU memory allows)
    # Decrease if running out of memory
)

Sequence Length

Set maximum input length:
config = BaseConfig(
    embedding_max_seq_len=2048,  # Default: 2048 tokens
    # Adjust based on your document lengths
)

Normalization

Control whether embeddings are normalized:
config = BaseConfig(
    embedding_return_as_normalized=True,  # Default: True
    # Normalized embeddings enable cosine similarity via dot product
)

Using NV-Embed-v2

Installation

Install dependencies for NV-Embed-v2:
pip install transformers torch pynvml

Multi-GPU Setup

NV-Embed-v2 automatically uses multiple GPUs:
import os

# Specify visible GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2"
)
From src/remem/embedding_model/NVEmbedV2.py:21-53, the model checks GPU usage and distributes across available devices:
# Automatic GPU allocation based on free memory
# GPUs with >10% usage are excluded
# Remaining GPUs share the embedding workload

Instruction-Based Encoding

NV-Embed-v2 supports task-specific instructions:
# Internal usage (handled by REMem)
embeddings = embedding_model.batch_encode(
    texts=["query text"],
    instruction="Retrieve relevant passages"  # Optional instruction
)

Using OpenAI Embeddings

Setup API Key

export OPENAI_API_KEY="your-api-key-here"

Basic Usage

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    embedding_batch_size=100  # OpenAI allows larger batches
)

Caching

OpenAI embeddings are automatically cached to reduce API costs:
# Cache location (auto-created):
# outputs/{dataset}/embedding_cache/{model_name}_embedding_cache.sqlite

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    dataset="musique"  # Creates outputs/musique/embedding_cache/
)

Azure OpenAI

Use Azure-hosted OpenAI models:
export AZURE_OPENAI_API_KEY="your-azure-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export OPENAI_API_VERSION="2024-02-15-preview"
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    use_azure=True
)

Parallel Encoding

OpenAI embeddings support parallel processing:
# From src/remem/embedding_model/openai_embedding_client.py:283-309
# Automatic parallel encoding for large batches
# Each text is encoded independently to prevent batch failures

Using GritLM

Installation

pip install gritlm

Configuration

config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)

Instruction Format

GritLM uses a specific instruction format:
# From src/remem/embedding_model/GritLM.py:71-72
# Format: "<|user|>\n{instruction}\n<|embed|>\n"
# Or just: "<|embed|>\n" if no instruction

Custom Embedding Servers

Local Server Setup

Run a local embedding server with OpenAI-compatible API:
# Example with sentence-transformers server
python -m remem.embedding_model.sentence_transformer_server \
    --model nvidia/NV-Embed-v2 \
    --port 8001

Client Configuration

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",  # Model name
    llm_base_url="http://localhost:8001/v1/"  # Server URL
)
When using custom servers, ensure the server is running before initializing REMem.

Performance Optimization

GPU Memory Management

For NV-Embed-v2, optimize GPU allocation:
import os

# Use specific GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=8  # Reduce if OOM
)

Batch Size Tuning

Optimal batch sizes by model:
# NV-Embed-v2 (local GPU)
config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=16  # Default, adjust based on GPU memory
)

# OpenAI (API)
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    embedding_batch_size=100  # Larger batches for API calls
)

# GritLM (local GPU)
config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)

Caching for OpenAI

Maximize cache hits to reduce costs:
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    force_index_from_scratch=False,  # Reuse cached embeddings
    dataset="my_dataset"  # Consistent dataset name for cache
)

Embedding Dimensions

Different models produce different embedding sizes:
ModelDimensions
nvidia/NV-Embed-v24096
text-embedding-3-large3072
text-embedding-3-small1536
text-embedding-ada-0021536
GritLM/GritLM-7B4096
From src/remem/embedding_model/openai_embedding_client.py:24-35:
def _get_embedding_dimension(embedding_model_name: str) -> int:
    if "text-embedding-3-large" in embedding_model_name:
        return 3072
    elif "text-embedding-3-small" in embedding_model_name:
        return 1536
    elif "Qwen3-Embedding-8B" in embedding_model_name:
        return 4096
    elif "NV-Embed-v2" in embedding_model_name:
        return 4096
    # ...

Troubleshooting

Out of Memory (OOM) Errors

Reduce batch size or use fewer GPUs:
config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=8,  # Reduced from 16
)

OpenAI Rate Limits

The client automatically retries with exponential backoff:
# From openai_embedding_client.py:54-87
# Automatic retry with backoff:
# - Base delay: 1 second
# - Max delay: 60 seconds
# - Exponential factor: 2
# - Max retries: 5 (configurable)

Content Filtering (OpenAI)

OpenAI may reject certain content. REMem creates zero embeddings as fallback:
# From openai_embedding_client.py:367-371
# Automatically handled - creates zero vector
# Logs warning but continues processing

Next Steps

Configuration

Explore all configuration options

Indexing

Learn how to index documents

Build docs developers (and LLMs) love