Documentation Index Fetch the complete documentation index at: https://mintlify.com/intuit-ai-research/REMem/llms.txt
Use this file to discover all available pages before exploring further.
Overview
REMem supports multiple embedding models for encoding text into dense vectors. The embedding model is specified via the embedding_model_name parameter in BaseConfig.
Supported Models
NV-Embed-v2 (Default)
NVIDIA’s state-of-the-art embedding model with 4096-dimensional embeddings.
from remem.utils.config_utils import BaseConfig
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" , # Default
embedding_batch_size = 16 ,
embedding_max_seq_len = 2048
)
Features:
4096-dimensional embeddings
Supports instruction-based encoding
Multi-GPU support with automatic device mapping
Requires local GPU inference
OpenAI Embeddings
Use OpenAI’s hosted embedding models:
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" , # 3072 dimensions
# OR
# embedding_model_name="text-embedding-3-small", # 1536 dimensions
# embedding_model_name="text-embedding-ada-002", # 1536 dimensions
embedding_batch_size = 16
)
Features:
Cloud-based (no local GPU required)
Automatic caching via SQLite
Handles content filtering gracefully
Parallel encoding for faster throughput
GritLM
Unified model for both retrieval and generation:
config = BaseConfig(
embedding_model_name = "GritLM/GritLM-7B" ,
embedding_batch_size = 16
)
Features:
Can perform both embedding and text generation
Instruction-based encoding
Multi-GPU support
Custom OpenAI-Compatible Servers
Use local embedding servers with OpenAI-compatible APIs:
config = BaseConfig(
embedding_model_name = "custom-model-name" ,
llm_base_url = "http://localhost:8001/v1/"
)
Configuration Options
Batch Size
Control encoding throughput:
config = BaseConfig(
embedding_batch_size = 16 , # Default: 16
# Increase for faster encoding (if GPU memory allows)
# Decrease if running out of memory
)
Sequence Length
Set maximum input length:
config = BaseConfig(
embedding_max_seq_len = 2048 , # Default: 2048 tokens
# Adjust based on your document lengths
)
Normalization
Control whether embeddings are normalized:
config = BaseConfig(
embedding_return_as_normalized = True , # Default: True
# Normalized embeddings enable cosine similarity via dot product
)
Using NV-Embed-v2
Installation
Install dependencies for NV-Embed-v2:
pip install transformers torch pynvml
Multi-GPU Setup
NV-Embed-v2 automatically uses multiple GPUs:
import os
# Specify visible GPUs
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0,1,2,3"
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2"
)
From src/remem/embedding_model/NVEmbedV2.py:21-53, the model checks GPU usage and distributes across available devices:
# Automatic GPU allocation based on free memory
# GPUs with >10% usage are excluded
# Remaining GPUs share the embedding workload
Instruction-Based Encoding
NV-Embed-v2 supports task-specific instructions:
# Internal usage (handled by REMem)
embeddings = embedding_model.batch_encode(
texts = [ "query text" ],
instruction = "Retrieve relevant passages" # Optional instruction
)
Using OpenAI Embeddings
Setup API Key
export OPENAI_API_KEY = "your-api-key-here"
Basic Usage
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
embedding_batch_size = 100 # OpenAI allows larger batches
)
Caching
OpenAI embeddings are automatically cached to reduce API costs:
# Cache location (auto-created):
# outputs/{dataset}/embedding_cache/{model_name}_embedding_cache.sqlite
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
dataset = "musique" # Creates outputs/musique/embedding_cache/
)
Azure OpenAI
Use Azure-hosted OpenAI models:
export AZURE_OPENAI_API_KEY = "your-azure-key"
export AZURE_OPENAI_ENDPOINT = "https://your-resource.openai.azure.com/"
export OPENAI_API_VERSION = "2024-02-15-preview"
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
use_azure = True
)
Parallel Encoding
OpenAI embeddings support parallel processing:
# From src/remem/embedding_model/openai_embedding_client.py:283-309
# Automatic parallel encoding for large batches
# Each text is encoded independently to prevent batch failures
Using GritLM
Installation
Configuration
config = BaseConfig(
embedding_model_name = "GritLM/GritLM-7B" ,
embedding_batch_size = 16
)
GritLM uses a specific instruction format:
# From src/remem/embedding_model/GritLM.py:71-72
# Format: "<|user|>\n{instruction}\n<|embed|>\n"
# Or just: "<|embed|>\n" if no instruction
Custom Embedding Servers
Local Server Setup
Run a local embedding server with OpenAI-compatible API:
# Example with sentence-transformers server
python -m remem.embedding_model.sentence_transformer_server \
--model nvidia/NV-Embed-v2 \
--port 8001
Client Configuration
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" , # Model name
llm_base_url = "http://localhost:8001/v1/" # Server URL
)
When using custom servers, ensure the server is running before initializing REMem.
GPU Memory Management
For NV-Embed-v2, optimize GPU allocation:
import os
# Use specific GPUs
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0,1"
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" ,
embedding_batch_size = 8 # Reduce if OOM
)
Batch Size Tuning
Optimal batch sizes by model:
# NV-Embed-v2 (local GPU)
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" ,
embedding_batch_size = 16 # Default, adjust based on GPU memory
)
# OpenAI (API)
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
embedding_batch_size = 100 # Larger batches for API calls
)
# GritLM (local GPU)
config = BaseConfig(
embedding_model_name = "GritLM/GritLM-7B" ,
embedding_batch_size = 16
)
Caching for OpenAI
Maximize cache hits to reduce costs:
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
force_index_from_scratch = False , # Reuse cached embeddings
dataset = "my_dataset" # Consistent dataset name for cache
)
Embedding Dimensions
Different models produce different embedding sizes:
Model Dimensions nvidia/NV-Embed-v2 4096 text-embedding-3-large 3072 text-embedding-3-small 1536 text-embedding-ada-002 1536 GritLM/GritLM-7B 4096
From src/remem/embedding_model/openai_embedding_client.py:24-35:
def _get_embedding_dimension ( embedding_model_name : str ) -> int :
if "text-embedding-3-large" in embedding_model_name:
return 3072
elif "text-embedding-3-small" in embedding_model_name:
return 1536
elif "Qwen3-Embedding-8B" in embedding_model_name:
return 4096
elif "NV-Embed-v2" in embedding_model_name:
return 4096
# ...
Troubleshooting
Out of Memory (OOM) Errors
Reduce batch size or use fewer GPUs:
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" ,
embedding_batch_size = 8 , # Reduced from 16
)
OpenAI Rate Limits
The client automatically retries with exponential backoff:
# From openai_embedding_client.py:54-87
# Automatic retry with backoff:
# - Base delay: 1 second
# - Max delay: 60 seconds
# - Exponential factor: 2
# - Max retries: 5 (configurable)
Content Filtering (OpenAI)
OpenAI may reject certain content. REMem creates zero embeddings as fallback:
# From openai_embedding_client.py:367-371
# Automatically handled - creates zero vector
# Logs warning but continues processing
Next Steps
Configuration Explore all configuration options
Indexing Learn how to index documents