Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pingcap/tidb/llms.txt

Use this file to discover all available pages before exploring further.

TiDB supports a native VECTOR data type designed for storing and querying high-dimensional vector embeddings produced by AI/ML models. Combined with vector indexes and distance functions, you can build semantic search, recommendation systems, and other embedding-based applications directly in TiDB without a separate vector database.

The VECTOR data type

The VECTOR(n) type stores a fixed-dimension array of 32-bit floats, where n is the number of dimensions (up to 16383).
-- Create a table with a 1536-dimensional embedding column (OpenAI ada-002 output size)
CREATE TABLE documents (
    id        INT          PRIMARY KEY AUTO_INCREMENT,
    content   TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL
);
TiDB also provides VECTOR64(n) for 64-bit float precision when higher accuracy is required.

Inserting vectors

Pass vector values as a JSON-style array string:
INSERT INTO documents (content, embedding)
VALUES (
    'TiDB is a distributed SQL database',
    '[0.021, -0.134, 0.305, ...]'  -- 1536 values total
);

Distance functions

TiDB provides built-in functions to compute similarity between vectors:
FunctionDescription
VEC_COSINE_DISTANCE(a, b)Cosine distance (0 = identical, 2 = opposite)
VEC_L2_DISTANCE(a, b)Euclidean (L2) distance
VEC_INNER_PRODUCT(a, b)Inner product (dot product)
VEC_L1_DISTANCE(a, b)Manhattan (L1) distance
Lower distance values indicate higher similarity for VEC_COSINE_DISTANCE and VEC_L2_DISTANCE.

Similarity search (ANN)

Use ORDER BY with a distance function and LIMIT to retrieve the nearest neighbors:
-- Find the 5 most similar documents to a query embedding
SELECT id, content,
       VEC_COSINE_DISTANCE(embedding, '[0.021, -0.134, 0.305, ...]') AS distance
FROM documents
ORDER BY distance
LIMIT 5;
This pattern is called Approximate Nearest Neighbor (ANN) search. TiDB uses the vector index to answer these queries efficiently without scanning the full table.

Creating a vector index

Vector indexes accelerate ANN queries. TiDB implements the HNSW (Hierarchical Navigable Small World) algorithm, which provides fast approximate search with configurable recall/speed trade-offs.
-- Create an HNSW vector index using cosine distance
CREATE VECTOR INDEX idx_embedding
    USING HNSW
    ON documents ((VEC_COSINE_DISTANCE(embedding)));
You can also define the index inline during table creation:
CREATE TABLE documents (
    id        INT          PRIMARY KEY AUTO_INCREMENT,
    content   TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL,
    VECTOR INDEX idx_embedding USING HNSW ((VEC_COSINE_DISTANCE(embedding)))
);
Or add it with ALTER TABLE:
ALTER TABLE documents
    ADD VECTOR INDEX idx_embedding
    USING HNSW ((VEC_COSINE_DISTANCE(embedding)));
Vector indexes are stored in TiFlash, TiDB’s columnar storage engine. Ensure TiFlash replicas are configured before creating a vector index.

Using a specific index

You can hint the optimizer to use a vector index:
SELECT id, content,
       VEC_COSINE_DISTANCE(embedding, '[0.021, -0.134, 0.305, ...]') AS distance
FROM documents USE INDEX (idx_embedding)
ORDER BY distance
LIMIT 5;

Example: storing and searching OpenAI embeddings

The following example shows an end-to-end workflow using OpenAI text-embedding-ada-002 (1536 dimensions):
-- 1. Create the table
CREATE TABLE articles (
    id        INT          PRIMARY KEY AUTO_INCREMENT,
    title     VARCHAR(500) NOT NULL,
    body      TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL,
    VECTOR INDEX idx_emb USING HNSW ((VEC_COSINE_DISTANCE(embedding)))
);

-- 2. Insert documents with precomputed embeddings
INSERT INTO articles (title, body, embedding)
VALUES ('Introduction to TiDB', 'TiDB is a distributed SQL database...', '[...]');

-- 3. Search by semantic similarity
SELECT title,
       VEC_COSINE_DISTANCE(embedding, $query_embedding) AS score
FROM articles
ORDER BY score
LIMIT 10;

Use cases

  • Semantic search: find documents, products, or records by meaning rather than keywords.
  • Recommendation systems: surface similar items based on user behavior embeddings.
  • Image similarity: compare image feature vectors from vision models.
  • Anomaly detection: identify outliers by measuring distance from cluster centers.
  • RAG (Retrieval-Augmented Generation): retrieve relevant context chunks to augment LLM prompts.

Verifying index usage

Use EXPLAIN to confirm the query plan uses the vector index:
EXPLAIN
SELECT id, VEC_COSINE_DISTANCE(embedding, '[...]') AS distance
FROM documents
ORDER BY distance
LIMIT 5;
Look for an ANNIndexScan operator in the output, which indicates the HNSW index is being used.
Without a vector index, ANN queries perform a full table scan. For tables with more than a few thousand rows, always create a vector index to ensure acceptable query latency.

Build docs developers (and LLMs) love