Embedding providers and vector storage

When DeepWiki clones a repository it splits the source files into chunks, converts each chunk into a vector embedding, and stores those vectors in a local index under ~/.adalflow/databases/. These embeddings power two features: wiki generation, where relevant code is retrieved as context for each documentation section, and the Ask feature, where your question is matched against the index to return accurate, code-grounded answers. Choosing an embedding provider is therefore a foundational configuration decision.

Embedder types

DeepWiki supports four embedding providers, selected with the DEEPWIKI_EMBEDDER_TYPE environment variable.

Type	Model	API key required	Notes
`openai`	`text-embedding-3-small` (256 dimensions)	`OPENAI_API_KEY`	Default. Batch size 500.
`google`	`gemini-embedding-001`	`GOOGLE_API_KEY`	Reuses your existing Gemini key. Batch size 100.
`ollama`	`nomic-embed-text`	None	Requires a local Ollama installation.
`bedrock`	`amazon.titan-embed-text-v2:0` (256 dimensions)	AWS credentials	Batch size 100.

Setting the embedder type

Add DEEPWIKI_EMBEDDER_TYPE to your .env file or environment:

# OpenAI (default — no variable needed, but explicit here for clarity)
DEEPWIKI_EMBEDDER_TYPE=openai

# Google AI
DEEPWIKI_EMBEDDER_TYPE=google

# Local Ollama
DEEPWIKI_EMBEDDER_TYPE=ollama

# AWS Bedrock
DEEPWIKI_EMBEDDER_TYPE=bedrock

Provider setup

OpenAI
Google AI
Ollama
AWS Bedrock

OpenAI is the default embedder. The text-embedding-3-small model is used with 256 dimensions and float encoding.Required environment variable

OPENAI_API_KEY=your_openai_api_key

Optional — custom base URLIf you need to route embedding requests through a private endpoint, set:

OPENAI_BASE_URL=https://your-endpoint.com/v1

This is the same variable used by the OpenAI text generation client (see Model providers).embedder.json excerpt

{
  "embedder": {
    "client_class": "OpenAIClient",
    "batch_size": 500,
    "model_kwargs": {
      "model": "text-embedding-3-small",
      "dimensions": 256,
      "encoding_format": "float"
    }
  }
}

Google AI embeddings use the gemini-embedding-001 model with the SEMANTIC_SIMILARITY task type, making them well-suited for code retrieval.Required environment variable

GOOGLE_API_KEY=your_google_api_key
DEEPWIKI_EMBEDDER_TYPE=google

No additional setup is required — the same API key used for Gemini text generation works for embeddings.embedder.json excerpt

{
  "embedder_google": {
    "client_class": "GoogleEmbedderClient",
    "batch_size": 100,
    "model_kwargs": {
      "model": "gemini-embedding-001",
      "task_type": "SEMANTIC_SIMILARITY"
    }
  }
}

If you are already using Google Gemini for text generation, enabling Google AI embeddings keeps your entire pipeline within a single provider, which can improve semantic consistency between retrieved context and generated output.

The Ollama embedder runs entirely on your local machine using nomic-embed-text. No API key or external network access is required.Required setup

Install Ollama from ollama.com.
Pull the embedding model:

ollama pull nomic-embed-text

Set the embedder type:

DEEPWIKI_EMBEDDER_TYPE=ollama

If Ollama runs on a remote host, also set:

OLLAMA_HOST=http://your-ollama-host:11434

embedder.json excerpt

{
  "embedder_ollama": {
    "client_class": "OllamaClient",
    "model_kwargs": {
      "model": "nomic-embed-text"
    }
  }
}

The Bedrock embedder uses Amazon Titan Embed Text v2 with 256 dimensions.Required environment variables

DEEPWIKI_EMBEDDER_TYPE=bedrock
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=us-east-1

For role-based access:

AWS_ROLE_ARN=arn:aws:iam::123456789012:role/DeepWikiRole

embedder.json excerpt

{
  "embedder_bedrock": {
    "client_class": "BedrockClient",
    "batch_size": 100,
    "model_kwargs": {
      "model": "amazon.titan-embed-text-v2:0",
      "dimensions": 256
    }
  }
}

Using OpenAI-compatible embedding models

Some providers (such as Alibaba Cloud’s Qwen family) expose an OpenAI-compatible embeddings API. DeepWiki ships a ready-made config template for this case at api/config/embedder.openai_compatible.json.bak. To switch to an OpenAI-compatible embedder:

Replace api/config/embedder.json with the contents of the compatible template:

{
  "embedder": {
    "client_class": "OpenAIClient",
    "initialize_kwargs": {
      "api_key": "${OPENAI_API_KEY}",
      "base_url": "${OPENAI_BASE_URL}"
    },
    "batch_size": 10,
    "model_kwargs": {
      "model": "text-embedding-v3",
      "dimensions": 256,
      "encoding_format": "float"
    }
  },
  "retriever": {
    "top_k": 20
  },
  "text_splitter": {
    "split_by": "word",
    "chunk_size": 350,
    "chunk_overlap": 100
  }
}

Set the environment variables:

OPENAI_API_KEY=your_provider_api_key
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

DeepWiki automatically substitutes ${OPENAI_API_KEY} and ${OPENAI_BASE_URL} placeholders in embedder.json with the values from your environment. No code changes are needed.

Switching embedders

Switching DEEPWIKI_EMBEDDER_TYPE after a repository has already been indexed requires regenerating that repository’s embeddings. Embeddings from different models occupy different vector spaces and are not interchangeable. Delete the existing database for the repository under ~/.adalflow/databases/ and regenerate the wiki to rebuild the index with the new embedder.

Text splitting configuration

Regardless of which embedder you choose, DeepWiki splits source files into overlapping chunks before embedding them. The defaults are defined in embedder.json:

{
  "text_splitter": {
    "split_by": "word",
    "chunk_size": 350,
    "chunk_overlap": 100
  },
  "retriever": {
    "top_k": 20
  }
}

chunk_size: Maximum number of words per chunk (350 by default).
chunk_overlap: Number of words shared between adjacent chunks (100 by default), preserving context at boundaries.
top_k: Number of chunks retrieved per query during RAG (20 by default).

These values can be tuned in embedder.json or in a custom config directory specified by DEEPWIKI_CONFIG_DIR.

Get Started

Configuration

Core Features

Self-Hosting

Reference

Embedding providers and vector storage

Embedder types

Setting the embedder type

Provider setup

Using OpenAI-compatible embedding models

Switching embedders

Text splitting configuration

Build docs developers (and LLMs) love

Get Started

Configuration

Core Features

Self-Hosting

Reference

Documentation Index

​Embedder types

​Setting the embedder type

​Provider setup

​Using OpenAI-compatible embedding models

​Switching embedders

​Text splitting configuration

Build docs developers (and LLMs) love

Embedder types

Setting the embedder type

Provider setup

Using OpenAI-compatible embedding models

Switching embedders

Text splitting configuration