Vector Search

TiDB supports a native VECTOR data type designed for storing and querying high-dimensional vector embeddings produced by AI/ML models. Combined with vector indexes and distance functions, you can build semantic search, recommendation systems, and other embedding-based applications directly in TiDB without a separate vector database.

The VECTOR data type

The VECTOR(n) type stores a fixed-dimension array of 32-bit floats, where n is the number of dimensions (up to 16383).

-- Create a table with a 1536-dimensional embedding column (OpenAI ada-002 output size)
CREATE TABLE documents (
    id        INT          PRIMARY KEY AUTO_INCREMENT,
    content   TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL
);

TiDB also provides VECTOR64(n) for 64-bit float precision when higher accuracy is required.

Inserting vectors

Pass vector values as a JSON-style array string:

INSERT INTO documents (content, embedding)
VALUES (
    'TiDB is a distributed SQL database',
    '[0.021, -0.134, 0.305, ...]'  -- 1536 values total
);

Distance functions

TiDB provides built-in functions to compute similarity between vectors:

Function	Description
`VEC_COSINE_DISTANCE(a, b)`	Cosine distance (0 = identical, 2 = opposite)
`VEC_L2_DISTANCE(a, b)`	Euclidean (L2) distance
`VEC_INNER_PRODUCT(a, b)`	Inner product (dot product)
`VEC_L1_DISTANCE(a, b)`	Manhattan (L1) distance

Lower distance values indicate higher similarity for VEC_COSINE_DISTANCE and VEC_L2_DISTANCE.

Similarity search (ANN)

Use ORDER BY with a distance function and LIMIT to retrieve the nearest neighbors:

-- Find the 5 most similar documents to a query embedding
SELECT id, content,
       VEC_COSINE_DISTANCE(embedding, '[0.021, -0.134, 0.305, ...]') AS distance
FROM documents
ORDER BY distance
LIMIT 5;

This pattern is called Approximate Nearest Neighbor (ANN) search. TiDB uses the vector index to answer these queries efficiently without scanning the full table.

Creating a vector index

Vector indexes accelerate ANN queries. TiDB implements the HNSW (Hierarchical Navigable Small World) algorithm, which provides fast approximate search with configurable recall/speed trade-offs.

-- Create an HNSW vector index using cosine distance
CREATE VECTOR INDEX idx_embedding
    USING HNSW
    ON documents ((VEC_COSINE_DISTANCE(embedding)));

You can also define the index inline during table creation:

CREATE TABLE documents (
    id        INT          PRIMARY KEY AUTO_INCREMENT,
    content   TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL,
    VECTOR INDEX idx_embedding USING HNSW ((VEC_COSINE_DISTANCE(embedding)))
);

Or add it with ALTER TABLE:

ALTER TABLE documents
    ADD VECTOR INDEX idx_embedding
    USING HNSW ((VEC_COSINE_DISTANCE(embedding)));

Vector indexes are stored in TiFlash, TiDB’s columnar storage engine. Ensure TiFlash replicas are configured before creating a vector index.

Using a specific index

You can hint the optimizer to use a vector index:

SELECT id, content,
       VEC_COSINE_DISTANCE(embedding, '[0.021, -0.134, 0.305, ...]') AS distance
FROM documents USE INDEX (idx_embedding)
ORDER BY distance
LIMIT 5;

Example: storing and searching OpenAI embeddings

The following example shows an end-to-end workflow using OpenAI text-embedding-ada-002 (1536 dimensions):

-- 1. Create the table
CREATE TABLE articles (
    id        INT          PRIMARY KEY AUTO_INCREMENT,
    title     VARCHAR(500) NOT NULL,
    body      TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL,
    VECTOR INDEX idx_emb USING HNSW ((VEC_COSINE_DISTANCE(embedding)))
);

-- 2. Insert documents with precomputed embeddings
INSERT INTO articles (title, body, embedding)
VALUES ('Introduction to TiDB', 'TiDB is a distributed SQL database...', '[...]');

-- 3. Search by semantic similarity
SELECT title,
       VEC_COSINE_DISTANCE(embedding, $query_embedding) AS score
FROM articles
ORDER BY score
LIMIT 10;

Use cases

Semantic search: find documents, products, or records by meaning rather than keywords.
Recommendation systems: surface similar items based on user behavior embeddings.
Image similarity: compare image feature vectors from vision models.
Anomaly detection: identify outliers by measuring distance from cluster centers.
RAG (Retrieval-Augmented Generation): retrieve relevant context chunks to augment LLM prompts.

Verifying index usage

Use EXPLAIN to confirm the query plan uses the vector index:

EXPLAIN
SELECT id, VEC_COSINE_DISTANCE(embedding, '[...]') AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Look for an ANNIndexScan operator in the output, which indicates the HNSW index is being used.

Without a vector index, ANN queries perform a full table scan. For tables with more than a few thousand rows, always create a vector index to ensure acceptable query latency.

SQL Language

Advanced SQL

HTTP API

Vector Search

The VECTOR data type

Inserting vectors

Distance functions

Similarity search (ANN)

Creating a vector index

Using a specific index

Example: storing and searching OpenAI embeddings

Use cases

Verifying index usage

Build docs developers (and LLMs) love

SQL Language

Advanced SQL

HTTP API

Documentation Index

​The VECTOR data type

​Inserting vectors

​Distance functions

​Similarity search (ANN)

​Creating a vector index

​Using a specific index

​Example: storing and searching OpenAI embeddings

​Use cases

​Verifying index usage

Build docs developers (and LLMs) love

The VECTOR data type

Inserting vectors

Distance functions

Similarity search (ANN)

Creating a vector index

Using a specific index

Example: storing and searching OpenAI embeddings

Use cases

Verifying index usage